Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community.
Pipeline: Our reconstruction pipeline consists of four stages. First, we perform automated reconstruction. After obtaining the reconstructed results, we apply the 4DHOISolver for optimization based on the annotations. Finally, we conduct physical imitation.
Our automated 4D reconstruction pipeline consists of three components: (a) human and object tracking, (b) 3D reconstruction, and (c) spatial alignment.
Annotation app: the first row shows the reference video, the second row displays the 3D-Human Joint annotations, and the third row presents the 3D-2D Projection annotations.
Pipeline: Our reconstruction pipeline consists of four stages. First, we perform automated reconstruction. After obtaining the reconstructed results, we apply the 4DHOISolver for optimization based on the annotations. Finally, we conduct physical imitation.
Our automated 4D reconstruction pipeline consists of three components: (a) human and object tracking, (b) 3D reconstruction, and (c) spatial alignment.
Annotation app: the first row shows the reference video, the second row displays the 3D-Human Joint annotations, and the third row presents the 3D-2D Projection annotations.
BibTex Code Here