“The estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, expression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape.” (MPIIS).

In a new paper accepted at CVPR 2019, researchers from the Max Planck Institute for Intelligent Systems introduce RingNet, an end-to-end trainable network which learns to compute 3D face shape from a single face image without 3D supervision. The researchers also built a new benchmark dataset and a 3D reconstruction benchmark challenge, NoW, both of which have been open-sourced on Github.

The Max Planck Institute responded to Synced questions regarding their new paper, RingNet and the open challenge.

How would you describe RingNet?

RingNet is an end-to-end trainable network that enforces shape consistency across face images of the subject with varying viewing angles, light conditions, resolution, and occlusion. It is able to learn 3D face geometry from 2D images, but it only need single image for inference.

Why does this research matter?

The idea of RingNet is quite general even if it is only used for faces. One can potentially use this idea for other 3D reconstruction purposes. In this work, our researchers also introduce a 3D reconstruction benchmark challenge NoW and an evaluation metric to provide the research community with quantitative feedback which was lacking in this field. The aim is to encourage other researchers to participate in this challenge and go beyond visual comparisons.

Since people can reconstruct a 3D face from single images with neck and full head, the technique can potentially be used for the animation industry or different face apps. There could be many interesting applications by combining RingNet and VOCA project (voice driven face animation model), for example, Using RingNet to prepare a template mesh for VOCA, then animate it with audio, i.e., a talking head from a face image.

Could you describe the Challenge NoW in more detail?

The goal of this benchmark is to measure the accuracy and robustness of 3D face reconstruction methods under variations in viewing angle, lighting, and common occlusions by a standard evaluation metric.

The NoW Dataset introduced to run the challenge contains 2054 2D images of 100 subjects, captured with an iPhone X, and a separate 3D head scan for each subject. The head scans serve as ground truth for the evaluation. The subjects were selected to contain variations in age, BMI, and sex (55 female, 45 male).

The challenge for all categories is to reconstruct a neutral 3D face given a single monocular image. Note that facial expressions are present in several images, which requires methods to disentangle identity and expression to evaluate the quality of the predicted identity.

Can you identify any bottlenecks in the research?

A bottleneck of the research topic is people tend to rely on only 2D landmarks. This certainly constrains the quality of the 3D reconstruction to some extent. Using dense correspondences should be able to push the limit to a new level.

Why do we need 3D when 2D is already looking good?

People may find some 2D face animations (like the Obama lip sync, Kumar, Rithesh, et al 2017 ) are already quite realistic. Although their results can look good by learning from huge datasets, we are lacking the ability to manipulate these 2D models accurately. Additionally, looking good is not enough, we need to understand what’s really going on. We live in a 3D world, this is what’s really behind every 2D picture and movie frame. So without 3D information, we can’t ask a GAN which is only trained with 2D images and landmarks to maintain the face shape in each frame when it’s rotating. A 3D model also gets more correspondence than 2D for example regarding each pixel’s relevance. Nowadays, face tracking by 2D landmark localization performs pretty well, but landmarks alone can’t provide the dense correspondence between frames. This is the key motivation for making 3D reconstruction more accurate and robust.

The paper Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision is here.