cap4d logo

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

Felix Taubner1,2        Ruihang Zhang1       Mathieu Tuli3       David B. Lindell1,2
1University of Toronto        2Vector Institute        3LG Electronics

TL;DR: Our model can create realistic 4D avatars using any number of reference images.

Overview


Our model works in two stages: First, a morphable multi-view diffusion model (MMDM) generates a large number of images from different views and expressions from the reference images. Then, we fit a 4D avatar to the generated images and the reference images. This avatar can be controlled via 3DMM and rendered in real time.

Interactive Viewer

Click on the images below to inspect 4D avatars in your browser, powered by our modified version of Brush. Note that this is experimental and quality may be reduced.
Usage of this viewer may result in crashes. Please ensure that your available VRAM is greater than 3GB. Do not open this website in multiple tabs.

Reference image

reference image(s)

Browser Not Supported

Your browser does not appear to support the interactive viewer. Currently, only Chrome (Desktop) Version 130+ is supported. Displaying a video instead!

Method

method figure

Overview of CAP4D. (a) The method takes as input an arbitrary number of reference images \(\mathbf{I}_\text{ref}\) that are encoded into the latent space of a variational autoencoder. An off-the-shelf face tracker (FlowFace) estimates a 3DMM (FLAME), \(\mathbf{M}_\text{ref}\), for each reference image, from which we derive conditioning signals that describe camera view direction, \(\mathbf{V}_\text{ref}\), head pose \(\mathbf{P}_\text{ref}\), and expression \(\mathbf{E}_\text{ref}\). We associate additional conditioning signals with each input noisy latent image based on the desired generated viewpoints, poses, and expressions. The MMDM generates images through a stochastic input–output conditioning procedure that randomly samples reference images and generated images during each step of the iterative image generation process. (b) The generated and reference images are used with the tracked and sampled 3DMMs to reconstruct a 4D avatar based on a deformable 3D Gaussian splatting representation (GaussianAvatars).

Morphable Multi-view Diffusion Model (MMDM)

method figure

MMDM architecture. Our model is initialized from Stable Diffusion 2.1, and we adapt the architecture for multi-view generation following CAT3D. We use a pre-trained image encoder to map the input images into the latent space, and we use the latent diffusion model to process eight images in parallel. We replace 2D attention layers after 2D residual blocks with 3D attention to share information between frames. The model is conditioned using images that provide information such as head pose (\(\mathbf{P}_\text{ref/gen}\)), expression (\(\mathbf{E}_\text{ref/gen}\)), and camera view (\(\mathbf{V}_\text{ref/gen}\)). These images are obtained from a 3DMM and concatenated to the latent images. The denoised latent image is decoded using a pre-trained decoder.

Gallery

We show various avatars generated using CAP4D in different settings: avatars from few reference images, avatars from single reference images, and more challenging settings such as avatars from images generated via text-prompt and avatars from artwork. Note that while the MMDM inherits weights from Stable Diffusion, we do not train the MMDM on non-photoreal images. Please click on the arrow buttons to the sides to view all results.

Reference images are shown in the top row, images generated by the MMDM in the middle row and the final 4D avatar in the last row.

Baseline Comparisons

We conduct experiments on the cross-reenactment and self-reenactment tasks. For quanitative results, we refer to our paper.

Self-reenactment results. We show more qualitative results from our self-reenactment evaluation with varying numbers of reference frames. The top row shows single-, second row few- (10) and the last row shows many-image (100) reconstructions. Our 4D avatar can leverage additional reference images to produce details that are not visible in the first reference image. Our results are significantly better compared to previous methods, especially when the view direction differs greatly from the reference image.

Cross-reenactment results. We generate an avatar based on a single image from the FFHQ dataset. The camera orbits around the head to allow a better assessment of 3D structure. Our method consistently produces 4D avatars of higher visual quality and 3D consistency even across challenging view deviations. Our avatar can also model realistic view-dependent lighting changes.

More Results

Effect of reference image quantity

CAP4D generates realistic avatars from single reference images. The model can leverage additional available reference images and recover details and geometry that are not visible in the first view. This results in an overall improved reconstruction of the reference identity. We provide a side-by-side comparison with single image, few images and many images below. The differences are subtle, however, notice freckles and birthmarks appearing with more reference images.

Editing of appearance and lighting

We can edit our avatars by applying off-the-shelf image editing models to the reference image. Here, we demonstrate appearance editing (Stable-Makeup) and relighting (IC-Light).

4D animation from audio

The generated avatar is controlled via FLAME 3DMM, hence we can leverage off-the-shelf speech-driven animation models such as CodeTalker to animate it from input audio.

BibTeX


@article{taubner2024cap4d,
  title={{CAP4D}: Creating Animatable {4D} Portrait Avatars with Morphable Multi-View Diffusion Models}, 
  author={Felix Taubner and Ruihang Zhang and Mathieu Tuli and David B. Lindell},
  booktitle={arxiv},
  year={2024}
}