Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

SJTU¹, Sii², FDU³,BJTU⁴, ZJU⁵

Abstract

Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community.

Method Details

Pipeline: Our reconstruction pipeline consists of four stages. First, we perform automated reconstruction. After obtaining the reconstructed results, we apply the 4DHOISolver for optimization based on the annotations. Finally, we conduct physical imitation.

Our automated 4D reconstruction pipeline consists of three components: (a) human and object tracking, (b) 3D reconstruction, and (c) spatial alignment.

Annotation app: the first row shows the reference video, the second row displays the 3D-Human Joint annotations, and the third row presents the 3D-2D Projection annotations.

Our automated 4D reconstruction pipeline consists of three components: (a) human and object tracking, (b) 3D reconstruction, and (c) spatial alignment.

Annotation app: the first row shows the reference video, the second row displays the 3D-Human Joint annotations, and the third row presents the 3D-2D Projection annotations.

Method Overview

Our pipeline. The optimizer first converts the human and object into 3D Gaussian points, then calculates a rendering loss by comparing the Gaussian-rendered image with the ground truth image. This loss is backpropagated to update the object’s pose parameters and the human’s LBS parameters. We also calculate an HOI loss, which includes collision, depth and contact losses. Finally, we refine the result by optimizing the contact regions.

Reconstruction Results

Input Video

Reconstruction

HOI Simulation

Before

After

Before

After

Before

After

Before

After

BibTeX

@misc{wen2025efficientscalablemonocularhumanobject, title={Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction}, author={Boran Wen and Ye Lu and Keyan Wan and Sirui Wang and Jiahong Zhou and Junxuan Liang and Xinpeng Liu and Bang Xiao and Dingbang Huang and Ruiyang Liu and Yong-Lu Li}, year={2025}, eprint={2512.00960}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.00960}, }

Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction