8x8 to 32x32 reconstructions using EnhanceGAN

Earlier this month, Google released a paper detailing a method for producing 16x image super-resolution from minute amounts of pixel data (8 x 8 images). By combining their ResNet deep convolutional network with their PixelRNN architecture, the latter of which using layered Long Short Term Memory Units to construct pixel-by-pixel image generations, Deep Mind was able to both upscale and enhance extremely low-res pictures of celebrities and bedrooms to 32x32 pixels, adding detail where there was none originally. Though it can’t be considered upscaling in the traditional sense, as there is often too little perceptual information to expect pixel-perfect reconstructions, predicting likely high-resolution images from low-resolution pixel data has many obvious potential applications, not to mention the broader potential for generative models dealing with sundry low-dimensional or corrupted data.

Curious as to whether or not an adversarial network could produce similar results, I have spent this month working on EnhanceGAN, a three-stage image super resolution and enhancing deep deconvolutional network in Torch7.

Architecture

EnhanceGAN has three stages of training, similar to that of a deep belief network, with combined adversarial nets and autoencoders in place of Restricted Boltzmann Machines. In simpler terms, I trained three separate models, with Model A taking in 8x8 images from the CelebA dataset and outputting its best 32x32 guess. Model B receives the output from a saved version of Model A and trains as a traditional autoencoder, compressing its 32x32 input images and then decoding them back to their original size. This has the effect of Model A laying out the rough, broad strokes of a celeb profile, with Model B then refining the shape and details to something more convincing.

Finally, Model C, consisting of six residual blocks, is placed atop Model A and B and all three are trained contiguously in a final stage of optimization. The residual net sharpens and enhances facial, lighting and environmental detail, making the generated images more photorealistic. All three stages use the ground truth 32x32 images as their targets.

Stage One and Two

Diagram of EnhanceGAN’s stage one and two models

The first two stages/layers share the same architecture except for their input dimensions (thus, the stage two encoder involves more deconvolutions). Deconvolutional layers encode and decode by backpropogating error based on “perceptual loss” on the ground truth image. To achieve this, I took inspiration from SRGAN and Neural Algorithm of Artistic Style and trained the autoencoder by running mean squared loss on features obtained from the publicly available VGG19 network weights. Perceptual loss (loss based on feature detection, instead of pixel-by-pixel reconstruction error), has shown to produce more realistic image reconstructions, and my results bear that out.

Stage One reconstructions provide high levels of generated detail for Stage Two to then refine

Stage Two reconstructions more accurately predict features of celebrity, such as head shape, hair color, and facial attributes

On top of a perceptual loss, the encoder and decoder also train with separate adversarial loss: an adversarial autoencoder (AAE) and generative adversarial network (GAN), respectively. The GAN adds additional information to the face and is a replacement for the PixelRNN stage of the Google model, while the AAE provides stability and nicer image generations, a neat bit of extra functionality for EnhanceGAN.

Stage Three

The final stage is a refinement layer consisting of six residual blocks, taking in the sigmoid-activated output of the first two stages. There is no upscaling or downscaling, and every block is followed by a ReLU activation except the last. There is no adversarial loss over either decoder in this stage; it is simply an AAE whose total error backpropogates over all three layers.

Although in a previous VAE+GAN version of my model this final layer did not seem to improve results, in EnhanceGAN it makes an obvious and important difference in the quality and realism of the output images.

Stage Three reconstructions reduce noise of Stage Two reconstructions and provide better color and lighting on the subjects

Conclusions

EnhanceGAN is able to capture an entire celebrity profile, including hair, clothing and occasionally backgrounds, though has varying degrees of success with side views and faces obfuscated by objects, limbs or hairstyles, presumably due to the sparsity of training examples. EnhanceGAN’s deconvolutional generations do not seem to achieve the same level of detail as autoregressive and pixel recurrent architectures like PixelRNN— there remains a fair amount of noise, which makes them easily distinguishable from the ground truth images — but they are easier to understand and more efficient to train, generate and scale. Ultimately, this is a relatively simple experiment running on a single AWS instance, but I think it shows the potential for deconvolutional and adversarial nets to efficiently “generate” probable high-resolution data based on noisy or minimal input data.

It will also be very interesting to continue to explore this kind of “deep belief” architecture for autoencoders, where separate models enhance and refine the output of the last, and are only trained together optionally at the end. This allows for each model to learn functions unique to different stages of image development, and would seem a natural fit for generative or creative machine learning, as the human brain and body presumably uses a staging process for any kind of generative task. A similar idea is used in StackGAN to produce convincing pictures of birds and flowers based on text embeddings alone.