Audio-driven 3D facial animation is no longer unfamiliar in the field of computer vision. However, due to the lack of available 3D datasets, models and standard evaluation metrics, current 3D facial animations remain dissimilar to natural human-speaking facial behaviours.

In a new paper accepted at CVPR 2019, researchers from the Max Planck Institute for Intelligent Systems introduce a unique 4D (moving 3D images) face dataset and its learned model VOCA — Voice Operated Character Animation. VOCA is a generic and straightforward voice-driven facial animation framework that can be applied to various facial shapes and generalizes well across various speech sources, languages, and 3D face templates. Both the dataset and VOCA model have been open-sourced on Github.

From the Max Planck Institute for Intelligent Systems Summary:

“We introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input — even speech in languages other than English — and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation.” (MPIIS).

The Max Planck Institute for Intelligent Systems responded to Synced questions regarding VOCA.

How would you describe VOCA?

VOCA (Voice Operated Character Animation) is a simple and generic speech driven facial animation framework that works across a range of identities. VOCA takes any speech signal and a static 3D head mesh as input and outputs a realistic facial animation. VOCA leverages recent advances in speech processing and 3D face modeling in order to generalize to new subjects. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. VOCA is trained on a self-captured multi-subject 4D face dataset (VOCASET).

Could you identify the key technologies in this research?

VOCA demonstrates how to combine different building blocks to obtain a simple and generic speech driven facial animation framework:

1) Using DeepSpeech — i.e. a pre-trained speech-to-text model — as audio feature extractor provides robusteness w.r.t. different audio sources due to its large training corpus (hundreds of hours of speech).

2) Conditioning on speaker style enables training across subjects (i.e. without conditioning, training across subjects regresses an implausible average motion), and synthesizing combinations of speaker styles during test time.

3) Factoring identity from facial motion allows us to animate a wide range of adult faces.

4) Using the sample mesh topology as the publicly available FLAME head model allows us to reconstruct subject-specific 3D head templates from a scan or an image. The FLAME topology further enables us to edit identity dependent shape and head pose during animation.

What impact might this research bring to the research community?

VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance.

Can you identify any bottlenecks in the research?

While VOCA can be used to realistically animate a wide range of adults faces from speech, it still lacks some of the details needed for conversational realism. Upper face motions (i.e. eyes and eyebrows) are not strongly correlated with the audio. The causal factor is emotion, which is absent in the data due the inherent difficulty of simulating emotional speech in a controlled capture environment. Thus, VOCA learns the causal facial motions from speech, which are mostly present in the lower face.

Non-verbal communication cues, such as head motion, are weakly correlated with the audio signal and hence are not modeled well by audio-driven techniques. VOCA offers animators and developers the possibility to include head motion, but does not infer it from data. A speech independent model for head motion could be used to simulate realistic results. Application specific techniques such as dyadic interactions between animated assistants and humans require attention mechanisms that consider spatial features such as eye tracking.

Can you predict any potential future developments related to this research?

Increasing the realism by adding emotions, and non-verbal cues such as head motion and eye gaze is a future line of research.

VOCA aims at animating faces from audio. A future line of research is to learn richer conversation model with expressive bodies, i.e. considering not only the face but also modeling body language.

The paper Capture, Learning, and Synthesis of 3D Speaking Styles is here. For more project details, please visit the project page.