CVPR 2026

Human Interaction-Aware 3D Reconstruction from a Single Image

HUG3D reconstructs physically plausible, high-fidelity textured 3D humans from a single image, while handling perspective distortion, occlusion, and inter-human interactions.

Gwanghyun Kim1*, Junghun James Kim2*, Suh Yoon Jeon1*, Jason Park1, Se Young Chun1,2†

1Dept. of Electrical and Computer Engineering, Seoul National University
2INMC & IPAI, Seoul National University
*Equal contribution. Corresponding author.

HUG3D motivating figure
Core challenges in monocular multi-human 3D reconstruction and HUG3D's holistic solution.

Abstract

Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image.

Video Comparisons

Method Overview

HUG3D method overview
Overview of our HUG3D framework. Given a single perspective image, (1) the Canonical Perspective-to-Orthographic View Transform (Pers2Ortho) converts it into a consistent multi-view orthographic representation. (2) The Human Group-Instance Multi-View Diffusion (HUG-MVD) model completes occluded geometry and texture while maintaining plausible interactions. (3) The Textured Mesh Reconstruction stage refines the mesh with our physics-based Human Group-Instance Geometry Reconstruction (HUG-GR) module and generates high-fidelity textures.