ArXiv Preprint
Labeling articulated objects in unconstrained settings have a wide variety of
applications including entertainment, neuroscience, psychology, ethology, and
many fields of medicine. Large offline labeled datasets do not exist for all
but the most common articulated object categories (e.g., humans). Hand labeling
these landmarks within a video sequence is a laborious task. Learned landmark
detectors can help, but can be error-prone when trained from only a few
examples. Multi-camera systems that train fine-grained detectors have shown
significant promise in detecting such errors, allowing for self-supervised
solutions that only need a small percentage of the video sequence to be
hand-labeled. The approach, however, is based on calibrated cameras and rigid
geometry, making it expensive, difficult to manage, and impractical in
real-world scenarios. In this paper, we address these bottlenecks by combining
a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark
estimates from videos with only two or three uncalibrated, handheld cameras.
With just a few annotations (representing 1-2% of the frames), we are able to
produce 2D results comparable to state-of-the-art fully supervised methods,
along with 3D reconstructions that are impossible with other existing
approaches. Our Multi-view Bootstrapping in the Wild (MBW) approach
demonstrates impressive results on standard human datasets, as well as tigers,
cheetahs, fish, colobus monkeys, chimpanzees, and flamingos from videos
captured casually in a zoo. We release the codebase for MBW as well as this
challenging zoo dataset consisting image frames of tail-end distribution
categories with their corresponding 2D, 3D labels generated from minimal human
intervention.
Mosam Dabhi, Chaoyang Wang, Tim Clifford, Laszlo Attila Jeni, Ian R. Fasel, Simon Lucey
2022-10-04