Generative adversarial networks (GANs) have become AI researchers’ “go-to” technique for generating photo-realistic synthetic images. Now, DeepMind researchers say that there may be a better option.

In a new paper, the Google-owned research company introduces its VQ-VAE 2 model for large scale image generation. The model is said to yield results competitive with state-of-the-art generative model BigGAN in synthesizing high-resolution images while delivering broader diversity and overcoming some native shortcomings of GANs.

“We use a hierarchical VQVAE which compresses images into a latent space which is about 50x smaller for ImageNet and 200x smaller for FFHQ Faces. The PixelCNN only models the latents, allowing it to spend its capacity on the global structure and most perceivable features,” tweeted DeepMind Researcher Aäron van den Oord, the first co-contributor on the paper.

BigGAN was introduced last year by DeepMind. Regarded as the most powerful model for image generation, BigGAN has become a favourite across academia, tripling the Inception Score (166.3) over the previous state of the art results and improving the Frechet Inception Distance (FID) score from 18.65 to 9.6. This February DeepMind introduced BigGAN-Deep which outperforms its previous generation.

DeepMind admits the GAN-based image generation technique is not flawless: It can suffer from mode collapse problems (the generator produces limited varieties of samples), lack of diversity (generated samples do not fully capture the diversity of the true data distribution); and evaluation challenges.

These issues prompted DeepMind to explore the use of Variational AutoEncoders (VAE), an unsupervised learning approach that trains the model to learn representations from datasets. In their NIPS 2017 paper Neural Discrete Representation Learning, DeepMind researchers introduced VQ-VAE, or Vector Quantised Variational AutoEncoder, a VAE variant that comprises an encoder that transforms image data into discrete rather than continuous latent variables (representations), and a decoder which reconstructs images from these variables.

The first innovation introduced in DeepMind’s new paper actually comes from a simple tactic: Remove the majority of unimportant image information in the training process without reducing the quality of image generation. DeepMind researchers said the idea was inspired from a longstanding photo file type that everyone will be familiar with, the JPEG, which achieves 10:1 image compression with little perceptible loss in image quality. The DeepMind neural network-based encoder compresses a 256 × 256 image to a 64 × 64 vector representation (downsized by four times) and a 32 ×32 presentation (downsized by eight times).

These two layers of representation inform the researchers second innovation: a hierarchical framework. The 64 × 64 vector representation captures the image’s local information such as texture; while the 32 ×32 representation targets global information such as object shape and geometry. The decoder then reconstructs an image from the two representations.

Even the image generation stage is trained in separate layers: a PixelCNN model with multi-headed self-attention layers models global information, and a second PixelCNN model with a deep residual conditioning stack models local features. Moreover, DeepMind’s hierarchical framework is not limited to two layers — to generate images with larger sizes (for example, 1024 × 1024), additional layers could be built depending on specific requirements.

Researchers used ImageNet and FFHQ as the datasets in their experiments. Trained on ImageNet 256 × 256 images, VQ-VAE generated comparable high-fidelity images and delivered higher diversity then BigGAN. On FFHQ 1024 × 1024 high-resolution face data, VQ-VAE generated realistic facial images while still covering some features represented only sparsely in the training dataset. The paper also discussed other evaluation metrics to test VQ-VAE performance.

VQ-VAE samples (left) and BigGAN deep samples (right) trained on ImageNet.

VQ-VAE generated facial images.

DeepMind Researcher Oriol Vinyals‏ tweeted “Surprising how simple ideas can yield such a good generative model! -Mean Squared Error loss on pixels -Non-autoregressive image decoder -Discrete latents w/ straight through estimator.” Vinyals also contributed to the creation of BigGAN and PixelCNN.

Read the paper Generating Diverse High-Fidelity Images with VQ-VAE-2 on arXiv. The project has been open-sourced on GitHub.