mvp4d logo

MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

Felix Taubner1,2        Ruihang Zhang1       Mathieu Tuli3       Sherwin Bahmani1,2       David B. Lindell1,2
1University of Toronto        2Vector Institute        3LG Electronics
SIGGRAPH Asia 2025
arXiv Code (Coming Soon)

TL;DR: Our model can create realistic 360 degree 4D portrait avatars from a single reference image.

Overview


MVP4D works in two stages: Given a reference image and an input animation, a morphable multi-view video diffusion model (MMVDM) generates a collection of videos from different views. Then, the generated videos are distilled into a 4D representation, which can be rendered and viewed in real time.

Interactive Viewer

Reference image

reference image(s)

Browser Not Supported

Your browser does not appear to support the interactive viewer. Currently, only Chrome (Desktop) Version 130+ is supported. Displaying a video instead!

Method

method figure

Overview of MVP4D. (a) The method takes as input a single reference image that is encoded into the latent space of a variational autoencoder. An off-the-shelf face tracker (FlowFace) estimates a 3DMM (FLAME) for the reference image, from which we derive conditioning signals that describe camera pose, head pose, and expression. We associate additional conditioning signals with each input noisy latent image based on the desired generated viewpoints, poses, and expressions. A multi-view video diffusion transformer denoises the latent multi-view video. We then use the variational autoencoder to decode the latent video frames to a series of multi-view image frames (the resulting multi-view video), which is distilled into a 4D representation for real-time visualization.

Gallery

We show various avatars generated using MVP4D in different settings: 360-degree avatars from a reference images, 120-degree avatars from a reference images, and more challenging settings such as avatars from images generated via text-prompt. Please click on the arrow buttons to the sides to view all results.

Reference images are shown in the left column, videos generated by the MMVDM in the middle column and the final 4D avatar in the last column.

360-degree avatars

120-degree avatars

Baseline Comparisons

We conduct experiments on the cross-reenactment and self-reenactment tasks. For quantitative results, we refer to our paper.

Self-reenactment results for forward facing avatars. We show more qualitative results from our self-reenactment evaluation on the Nersemble dataset. We compare multi-view diffusion models (top row) and 4D avatars (bottom row). CAP4D's MMDM produces sequences that contain a significant amount of temporal flickering, whereas our MMVDM generates videos that are smooth and consistent. Our resulting 4D avatar is more detailed and realistic than previous methods, especially with dynamic details such as wrinkles, or challenging structure such as hair.

Self-reenactment results for 360-degree avatars. We show more qualitative results from our self-reenactment evaluation on the RenderMe-360 dataset. Our MMDVM can generate significantly better 360 views than previous methods, leading to much more robust 4D avatar that is view-able from all directions.

Cross-reenactment results. We generate an avatar based on a single image from the FFHQ dataset. The camera orbits around the head to allow a better assessment of 3D structure. Our method consistently produces 4D avatars of higher visual quality, 3D consistency and better motion realism even across challenging view deviations. Our avatar can also model challenging structure such as hair and glasses.

More Results

Speech-driven 4D animation

We can use off-the-shelf speech-driven portrait video animation models such as Hallo3 to animate MVP4D using input speech audio. Volume on!

Avatars from text prompts

Using commercially available image generation models, we can generate 4D avatars from text descriptions.

Reference

reference image

videos generated with MMVDM

MVP4D avatar

BibTeX


@misc{taubner2025mvp4d,
  title={{MVP4D}: Multi-View Portrait Video Diffusion for Animatable {4D} Avatars}, 
  author={Felix Taubner and Ruihang Zhang and Mathieu Tuli and Sherwin Bahmani and David B. Lindell},
  year={2025},
  eprint={2510.12785},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.12785}, 
}