PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

CVPR 2026 (Highlight)

1Technical University of Munich
Input 04
Input 12
Geometry Input 04 Style Input 04
Geometry Input 01
"female with curly hair"
PercHead reconstructs 3D heads from a single input image and enables disentangled 3D editing using semantic maps combined with image or text-based style inputs.

Abstract

We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI.

Method

Overview of our method showing the reconstruction and editing pipelines.

Overview of Our Method. Our framework supports 3D Reconstruction from a single image and 3D Editing from a segmentation map and style input. We only need to adapt the encoders of our ViT-based architecture between the two tasks. We train on multi-view datasets and extend them with single-view data to improve diversity, without degrading 3D consistency. At the heart of our method is the supervision through DINOv2 and SAM 2.1. Our perceptual loss compares deep image representations, which helps us avoid signals from pixel-level differences.

Video

Reconstruction Results

Input 01
Input 02
Input 03
Input 04
Input 05
Input 06
Input 07
Input 08
Input 09
Input 10
Input 11
Input 12

Frame-by-Frame Video Reconstruction

Input
Reconstructed Input View
Reconstructed 3D View
Input
Reconstructed Input View
Reconstructed 3D View

Interactive 3D Editing UI

Processing in the video is sped up.

Disentangled Editing: Geometry via Segmentation & Style via Image

Geometry Input 01 Style Input 01
Geometry Input 02 Style Input 02
Geometry Input 03 Style Input 03
Geometry Input 04 Style Input 04
Geometry Input 05 Style Input 05
Geometry Input 06 Style Input 06
Geometry Input 07 Style Input 07
Geometry Input 08 Style Input 08
Geometry Input 09 Style Input 09
Geometry Input 10 Style Input 10
Geometry Input 11 Style Input 11
Geometry Input 12 Style Input 12

Disentangled Editing: Geometry via Segmentation & Style via Prompt

Geometry Input 01
"female with curly hair"
Geometry Input 02
"female with dark hair"
Geometry Input 03
"female with red hair"
Geometry Input 04
"female with gray hair"
Geometry Input 05
"male kid"
Geometry Input 06
"male adult"
Geometry Input 07
"female adult"
Geometry Input 08
"middle-aged female"
Geometry Input 09
"male with no beard"
Geometry Input 10
"male with gray beard"
Geometry Input 11
"man with dark skin"
Geometry Input 12
"serious looking female"

BibTeX


      @misc{oroz2025percheadperceptualheadmodel,
        title={PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing}, 
        author={Antonio Oroz and Matthias Nießner and Tobias Kirschstein},
        year={2025},
        eprint={2511.02777},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2511.02777}, 
      }