Accepted to ICRA 2024. Info below will be updated as the conference date approaches.

HandyPriors: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors

Shutong Zhang^1,, Yiling Qiao^2,, Guanglei Zhu^1,*, Eric Heiden³, Dylan Turpin^1,3, Jingzhou Liu¹,
Ming Lin², Miles Macklin³, Animesh Garg³

¹University of Toronto & Vector Institute ²University of Maryland College Park ³Nvidia

[Paper]

Abstract

Various heuristic objectives for modeling handobject interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HANDYPRIORS, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HANDYPRIORS attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.

Method

An overview of our optimization and filtering pipelines. We provide two alternatives for utilizing the differentiable priors. Given the image and the estimation from previous frames, (a) the optimization pipeline first initializes the poses with pre-trained networks and then minimizes the rendering and physics losses from the differentiable operators; (b) the filtering pipeline can take some simple observations and use Extended Kalman Filter (EKF) to update the state estimation. EKF requires differentiable physics and rendering to model the system and runs much faster than the optimization pipeline.

Results

Optimization Process of the Pose Estimation

Tracking

Contact Optimization

Human-Object Pose Optimization

Phosa [1]

w/ Rendering term

w/ Rendering+Physics terms

Reference

[1] Zhang, Jason Y., Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa. "Perceiving 3d human-object spatial arrangements from a single image in the wild." In ECCV 2020.

BibTeX

@inproceedings{zhang2023handypriors,
                author         = {Zhang, Shutong and Qiao, Yi-Ling and Zhu, Guanglei and Heiden, Eric and Turpin, Dylan and Liu, Jingzhou and Lin, Ming and Macklin, Miles and Garg, Animesh},
                booktitle      = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
                title          = {HandyPriors: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors},
                year           = {2024},
                doi            = {10.1109/ICRA57147.2024.10610748},
                eprint         = {2311.16552},
                archivePrefix  = {arXiv},
              }

Shutong Zhang1,*, Yiling Qiao2,*, Guanglei Zhu1,*, Eric Heiden3, Dylan Turpin1,3, Jingzhou Liu1, Ming Lin2, Miles Macklin3, Animesh Garg3

1University of Toronto & Vector Institute 2University of Maryland College Park 3Nvidia