Imagine the lips forming the Mona Lisa’s famous smile were to part, and she began “speaking” to you. This is not some sci-fi fantasy or a 3D face animation, it’s an effect achieved by researchers from Samsung AI lab and Skolkovo Institute of Science and Technology, who used adversarial learning to generate a photorealistic talking head model.

AI techniques have already been used to generate realistic video of people like former US President Barack Obama and movie star Scarlett Johansson, enabled in large part by the abundance of available visual data on these individuals. The new research however shows it is also possible to generate realistic content when source images are rare. Researchers leveraged their Few-Shot Adversarial Learning technique on one of the most widely recognized humans in history known through a single image: Lisa Gherardini, the subject of Leonardo da Vinci’s classic 16th century portrait.

The new Few-Shot Adversarial Learning method is trained on existing talking head datasets. The model extracts face landmarks from video sequences in these datasets, transforms these landmarks into a set of realistic photographs based on the target person (for example the Mona Lisa), then combines the images to synthesize a video/gif with the target person animated as if in speech.

Talking head image synthesis

Even one shot Adversarial Learning is possible, as shown with the Mona Lisa experiment and in the images below. Of course, more training frames will still produce higher realism.

Comparison of multi-shot training results

This few-shot learning superpower however does not come easy, as extensive pretraining (meta-learning) on a large corpus of talking head videos is required.

Meta-learning process

As illustrated above, the first steps in meta-learning involve translating the head images to embedding vectors with an embedder network. The corresponding results can then be used to predict the generator’s adaptive parameters. Then, a generator with updated parameters maps the input of face landmarks into output frames through a set of convolutional layers. Finally, the objective function of perceptual and adversarial losses (with the latter being implemented via a conditional projection discriminator) are chosen to compare the resulting image with the ground truth image.

Two talking head video datasets (VoxCeleb1 and VoxCeleb2) were used for model testing. The quantitative comparison of different methods with multiple few-shot learning and the corresponding generated results for both datasets are shown below.

Quantitative comparison of different methods

Source and ground truth images and the generated results comparing methods on the VoxCeleb1 dataset

Source and ground truth images and generated results comparing methods on the VoxCeleb2 dataset

For more details, see the paper Few-Shot Adversarial Learning of Realistic Neural Talking Head Models on arXiv.