3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

Felix Taubner, Prashant Raina, Mathieu Tuli, Eu Wern Teh, Chul Lee, Jinmiao Huang
LG Electronics
CVPR 2024

Using a two stage pipeline consisting of dense 2D alignment (center) and 3D model fitting (right), our face tracking pipeline can accurately track faces across challenging poses and expressions.

Abstract

When working with 3D facial data, improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often fall short in capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. Addressing these challenges, we propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations, integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally, we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos, which leads to performance gains on downstream tasks.

Method

pipeline overview

Our face tracker works in two stages: First, a dense, per-vertex 2D alignment is predicted from the input images. Then, a 3D reconstruction is obtained by optimizing a 3D head and camera model to fit the alignment.

2D Alignment

2d alignment module
Our 2D alignment model predicts the screen space position and confidence of all vertices of the FLAME head model. In an intermediate step, our network predicts a pixel-wise mapping between the UV space of the head mesh and the image space, which we call UV-to-image flow. First, deep features are predicted from an input image using the Segformer backbone. In UV space, positional embeddings are generated for every pixel. These positional embeddings are static, meaning they are learned during training but are fixed for any input image during inference. The UV embedding map and the image feature map are then fed into the RAFT optical flow network. This network predicts the UV-to-image flow, and also its confidence. The UV-to-image flow maps every pixel in UV space to a position in image space. Using the corresponding UV coordinate of each vertex, the image space position and confidence are sampled from this mapping for each vertex of the head mesh. This network architecture allows the flow of low-level features into the prediction of the alignment for great positional accuracy and temporal consistency. Furhtermore, we train this network on high quality 3D datasets for high 3D consistency.

3D Model Fitting

3d model fitting
With the 2D alignment of each vertex (image space position and confidence), a 3D model can be fitted to the observations. In the 3D model fitting stage, the 3D parameters of our head and camera models are optimized with respect to an energy function (similar to bundle adjustment). We use the FLAME 3D morphable model as our head model. With its learned shape and expression space, it serves as the geometry prior for the underconstrained monocular 3D tacking problem. We allow additional per-vertex deformations, which is possible due to the dense alignment data. The parameters that are optimized during the 3D model fitting stage are head pose, FLAME parameters, per-vertex deformations and camera intrinsics. The objective energy function is based on the 2D reprojection error of the 3D vertices, weighted by their confidence. We further regularize the energy function with a neutral shape prediction from MICA, the FLAME parameters and the accleration of each vertex. This formulation is easily extendable to include multiple views, and due to the memory efficiency of the alignment data, many frames (>1000) can be fitted simultaneously.

Qualitative Comparison

Videos

Qualitative results on our benchmark based on the Multiface dataset. Our model produces more accurate 3D reconstructions that are better aligned with the input video, compared to the photometric head tracker MPT, DECA, HRN and 3DDFA-v2.

Images

Single image reconstruction on images from the FFHQ dataset. Our method demonstrates a better overall alignment and reconstruction. (a) input image, (b) our 2D alignment, (c, d) our 3D reconstruction, (e) HRN (f) DECA (g) PRNet (h)3DDFA-v2

More Results

Our head tracker applied to videos from the CelebV-HQ dataset. Despite being trained only on images captured in controlled lab environments, our model generalizes well to videos captured in-the-wild.

Video Presentation

BibTeX

@InProceedings{taubner2024flowface,
        author    = {Taubner, Felix and Raina, Prashant and Tuli, Mathieu and Teh, Eu Wern and Lee, Chul and Huang, Jinmiao},
        title     = {3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow},
        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        month     = {June},
        year      = {2024},
        pages     = {1227-1237}
    }