MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

Felix Taubner^1,2 Ruihang Zhang¹ Mathieu Tuli³ Sherwin Bahmani^1,2 David B. Lindell^1,2

¹University of Toronto ²Vector Institute ³LG Electronics

SIGGRAPH Asia 2025

Paper arXiv Code (Coming Soon)

TL;DR: Our model can create realistic 360 degree 4D portrait avatars from a single reference image.

Overview

MVP4D works in two stages: Given a reference image and an input animation, a morphable multi-view video diffusion model (MMVDM) generates a collection of videos from different views. Then, the generated videos are distilled into a 4D representation, which can be rendered and viewed in real time.

Interactive Viewer

Reference image

reference image(s)

Browser Not Supported

Your browser does not appear to support the interactive viewer. Currently, only Chrome (Desktop) Version 130+ is supported. Displaying a video instead!

Method

method figure

Overview of MVP4D. (a) The method takes as input a single reference image that is encoded into the latent space of a variational autoencoder. An off-the-shelf face tracker (FlowFace) estimates a 3DMM (FLAME) for the reference image, from which we derive conditioning signals that describe camera pose, head pose, and expression. We associate additional conditioning signals with each input noisy latent image based on the desired generated viewpoints, poses, and expressions. A multi-view video diffusion transformer denoises the latent multi-view video. We then use the variational autoencoder to decode the latent video frames to a series of multi-view image frames (the resulting multi-view video), which is distilled into a 4D representation for real-time visualization.

Gallery

We show various avatars generated using MVP4D in different settings: 360-degree avatars from a reference images, 120-degree avatars from a reference images, and more challenging settings such as avatars from images generated via text-prompt. Please click on the arrow buttons to the sides to view all results.

Reference images are shown in the left column, videos generated by the MMVDM in the middle column and the final 4D avatar in the last column.

360-degree avatars

reference image

videos generated with MMVDM

MVP4D avatar

reference image

videos generated with MMVDM

MVP4D avatar

reference image

videos generated with MMVDM

MVP4D avatar

120-degree avatars

reference image

videos generated with MMVDM

MVP4D avatar

reference image

videos generated with MMVDM

MVP4D avatar

Baseline Comparisons

We conduct experiments on the cross-reenactment and self-reenactment tasks. For quantitative results, we refer to our paper.

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

reference image

CAP4D (MMDM) MVP4D (MMVDM)

Portrait4D GAGAvatar CAP4D MVP4D GT

Self-reenactment results for forward facing avatars. We show more qualitative results from our self-reenactment evaluation on the Nersemble dataset. We compare multi-view diffusion models (top row) and 4D avatars (bottom row). CAP4D's MMDM produces sequences that contain a significant amount of temporal flickering, whereas our MMVDM generates videos that are smooth and consistent. Our resulting 4D avatar is more detailed and realistic than previous methods, especially with dynamic details such as wrinkles, or challenging structure such as hair.

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

reference image

FYE + PanoHead CAP4D (MMDM) MVP4D (MMVDM) MVP4D GT

Self-reenactment results for 360-degree avatars. We show more qualitative results from our self-reenactment evaluation on the RenderMe-360 dataset. Our MMDVM can generate significantly better 360 views than previous methods, leading to much more robust 4D avatar that is view-able from all directions.

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

reference image

Portrait4D GAGAvatar CAP4D MVP4D driving video

Cross-reenactment results. We generate an avatar based on a single image from the FFHQ dataset. The camera orbits around the head to allow a better assessment of 3D structure. Our method consistently produces 4D avatars of higher visual quality, 3D consistency and better motion realism even across challenging view deviations. Our avatar can also model challenging structure such as hair and glasses.

More Results

Speech-driven 4D animation

We can use off-the-shelf speech-driven portrait video animation models such as Hallo3 to animate MVP4D using input speech audio. Volume on!

Avatars from text prompts

Using commercially available image generation models, we can generate 4D avatars from text descriptions.

Reference

reference image

videos generated with MMVDM

MVP4D avatar

BibTeX


@misc{taubner2025mvp4d,
  title={{MVP4D}: Multi-View Portrait Video Diffusion for Animatable {4D} Avatars}, 
  author={Felix Taubner and Ruihang Zhang and Mathieu Tuli and Sherwin Bahmani and David B. Lindell},
  year={2025},
  eprint={2510.12785},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.12785}, 
}

free hit counter