PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Abstract

We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing — two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a uniﬁed base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and ﬁnetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts.

Method

Overview of Our Method. Our framework supports 3D Reconstruction from a single image and 3D Editing from a segmentation map and style input. Both tasks share a 3D ViT decoder that lifts 2D features via iterative cross-attention, differing only in the encoder. The reconstruction model uses a dual-branch encoder with DINOv2 and a task-specific ViT; the editing model uses a segmentation ViT and injects a global CLIP style token. Outputs are rendered via Gaussian Splatting and refined with a 2D CNN, with supervision from DINOv2 and SAM2.1.

Video

Reconstruction Results

Frame-by-Frame Video Reconstruction

Input

Reconstructed Input View

Reconstructed 3D View

Input

Reconstructed Input View

Reconstructed 3D View

Interactive 3D Editing UI

Processing in the video is sped up.

Disentangled Editing: Geometry via Segmentation & Style via Image

Disentangled Editing: Geometry via Segmentation & Style via Prompt

"female with curly hair"

"female with dark hair"

"female with red hair"

"female with gray hair"

"male kid"

"male adult"

"female adult"

"middle-aged female"

"male with no beard"

"male with gray beard"

"man with dark skin"

"serious looking female"

BibTeX


      @misc{oroz2025percheadperceptualheadmodel,
        title={PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing}, 
        author={Antonio Oroz and Matthias Nießner and Tobias Kirschstein},
        year={2025},
        eprint={2511.02777},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2511.02777}, 
      }