Overview of Our Method. Our framework supports 3D Reconstruction from a single image and 3D Editing from a segmentation map and style input. We only need to adapt the encoders of our ViT-based architecture between the two tasks. We train on multi-view datasets and extend them with single-view data to improve diversity, without degrading 3D consistency. At the heart of our method is the supervision through DINOv2 and SAM 2.1. Our perceptual loss compares deep image representations, which helps us avoid signals from pixel-level differences.
Processing in the video is sped up.
@misc{oroz2025percheadperceptualheadmodel,
title={PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing},
author={Antonio Oroz and Matthias Nießner and Tobias Kirschstein},
year={2025},
eprint={2511.02777},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.02777},
}