MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Summary
This paper presents the first end-to-end neural framework for motion capture that works with arbitrary skeleton structures from monocular video. The key innovation is making pose-to-rotation prediction learnable by conditioning on a reference pose-rotation pair that anchors the coordinate system, resolving fundamental ambiguities in mapping 3D positions to joint rotations. The system eliminates mesh intermediates and analytical inverse kinematics, achieving 20x faster inference while reducing rotation errors from ~17° to ~10°. The method generalizes across humans, animals, and fictional characters without skeleton-specific training.
Key findings
- Reference pose-rotation pairs resolve coordinate system ambiguity, enabling learnable pose-to-rotation mapping where analytical IK fails
- End-to-end training allows pose representations to adapt for rotation objectives, improving accuracy over factorized pipelines
- Removing mesh intermediates improves robustness and speeds up inference 20x compared to mesh-based approaches
- Method achieves best performance on unseen skeletons (6.54° error) due to effective coordinate system anchoring
How to implement
- Build real-time character animation tools for game engines that can animate any rigged 3D character from smartphone video input, enabling indie developers to create professional mocap without expensive hardware
- Develop automated animation pipelines for film/VFX studios that can retarget human performances to fantasy creatures or animals, reducing manual keyframing work for creature animation
- Create AR/VR applications that let users control virtual avatars of any species or fictional character using just their phone camera, enabling more diverse virtual embodiment experiences