Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

SJTU1, Sii2, FDU3,BJTU4, STU5

Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction.

Abstract

Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community.

Method Details

Method Overview

Pipeline Image

Our pipeline. The optimizer first converts the human and object into 3D Gaussian points, then calculates a rendering loss by comparing the Gaussian-rendered image with the ground truth image. This loss is backpropagated to update the object’s pose parameters and the human’s LBS parameters. We also calculate an HOI loss, which includes collision, depth and contact losses. Finally, we refine the result by optimizing the contact regions.

Interactive 3D Objects

Self-recorded

Objaverse

Self-recorded

Inpaint

Interact with the 3D models above by dragging to rotate and scrolling to zoom.

Reconstruction Results

Input Video

Reconstruction

HOI Simulation

Before

After

Before

After

Before

After

Before

After

BibTeX

BibTex Code Here