PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

CVPR 2026 Poster

Abstract

We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI.

Method

Overview of Our Method. Our framework supports 3D Reconstruction from a single image and 3D Editing from a segmentation map and style input. We only need to adapt the encoders of our ViT-based architecture between the two tasks. We train on multi-view datasets and extend them with single-view data to improve diversity, without degrading 3D consistency. At the heart of our method is the supervision through DINOv2 and SAM 2.1. Our perceptual loss compares deep image representations, which helps us avoid signals from pixel-level differences.

Video

Reconstruction Results

Frame-by-Frame Video Reconstruction

Input

Reconstructed Input View

Reconstructed 3D View

Input

Reconstructed Input View

Reconstructed 3D View

Interactive 3D Editing UI

Processing in the video is sped up.

Disentangled Editing: Geometry via Segmentation & Style via Image

Disentangled Editing: Geometry via Segmentation & Style via Prompt

"female with curly hair"

"female with dark hair"

"female with red hair"

"female with gray hair"

"male kid"

"male adult"

"female adult"

"middle-aged female"

"male with no beard"

"male with gray beard"

"man with dark skin"

"serious looking female"

BibTeX


        @inproceedings{oroz2026perchead,
          title={Perchead: Perceptual head model for single-image 3d head reconstruction \& editing},
          author={Oroz, Antonio and Nie{\ss}ner, Matthias and Kirschstein, Tobias},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          pages={4097--4108},
          year={2026}
        }