The training of this super resolution model uses the loss function based on the VGG model’s activations. The loss function remains fixed throughout the training unlike the critic part of a GAN.

The Feature map has 256 channels by 28 by 28 which are used to detect features such fur, an eyeball, wings and the type material among many other type of features. The activations at the same layer for the (target) original image and the generated image are compared using mean squared error or the least absolute error (L1) error for the base loss. These are feature losses. This error function uses L1 error.

This allows the loss function to know what features are in the target ground truth image and to evaluate how well the model’s prediction’s features match these rather than only comparing pixel difference.

Training details

The training process begins with a model as described above: a U-Net based on the ResNet-34 architecture pretrained on ImageNet using a loss function based on the VGG-16 architecture pretrained on ImageNet combined with pixel loss and a gram matrix.

Training data

With super resolution it is fortunate in most applications there is an almost infinite amount of data can be created as a training set. If a set of high resolution images is acquired, these can be encoded/resized to smaller images, so that we have a training set with a low resolution and high resolution image pair. The prediction from our model can then be used evaluated against the high resolution image.

The low resolution image is initially a copy of the target/ground truth image at half the dimensions. The low resolution image is then initially upscaled using bi-linear transformation to make it the same dimensions as the target image to input into the U-Net based model.

The actions taken in this method of creating the training data are what the model learns to fit (reversing the process).

The training data can be further augmented by:

Randomly reducing the quality of the image within bounds

Taking random crops

Flipping the image horizontally

Adjusting the lighting of the image

Adding perspective warping

Randomly adding noise

Randomly punching small holes into the image

Randomly adding overlaid text or symbols

The images below are an example of data augmentation, all of these were generated from the same source image:

Example of data augmentation

Changing the quality reduction and noise to be random for each image improved the resulting model allowing it to learn how to improve all of these different forms of image degradation and to better generalise.

Feature and quality improvement

The U-Net based model enhances the details and features in the upscaled image generating an improved image though the function containing approximately 40 million parameters.

Training the head and the backbone of the model

Three methods used here in particular help the training process. These are progressive resizing, freezing then unfreezing the gradient descent update of the weights in the the backbone and discriminative learning rates.

The model’s architecture is split into two parts, the backbone and the head.

The backbone is the left hand section of the U-Net, the encoder/down-sampling part of the network based on ResNet-34. The head is the right hand section of the U-Net, the decoder/up-sampling part of the network.

The backbone has pretrained weights based on ResNet34 trained on ImageNet, this is the transfer learning.

The head needs its weights training as these layers’ weights are randomly initialised to produce the desired end output.

At the very start the output from the network is essentially random changes of pixels other than the Pixel Shuffle sub-convolutions with ICNR initialisation used as the first step in each upscale in the decoder/upsampling path of the network.

Once trained the head on top of the backbone allows the model to learn to do something different with its pretrained knowledge in the backbone.

Freeze the backbone, train the head

The weights in the backbone of the network are frozen so that only the weights in the head are initially being trained.

A learning rate finder is run for 100 iterations and plots the graph of loss against learning rate, a point around the steepest slope down towards the minimum loss is selected as the maximum learning rate. Alternatively a rate 10 times less than the lowest point can be used to see if that performs any better.

Learning rate against loss, optimal slope with backbone frozen

The fit one cycle policy is used to vary learning rate and momentum, described in detail in Leslie Smith’s paper

https://arxiv.org/pdf/1803.09820.pdf and in a post by Sylvain Gugger’s https://sgugger.github.io/the-1cycle-policy.html

Progressive resizing

It’s faster to train on larger numbers of smaller images initially and then scale up the network and training images. Upscaling and improving an image to 128px by 128px image from 64px by 64px is a much easier task than performing that operation on a larger image and much quicker on a larger dataset. This is called progressive resizing, it also helps the model to generalise better as is sees many more different images and less likely to be overfitting.

This progressive resizing approach is based on excellent research from Nvidia with progressive GANs: https://research.nvidia.com/sites/default/files/pubs/2017-10_Progressive-Growing-of/karras2018iclr-paper.pdf . It was also the approach Fastai used to beat the Tech giants at training on ImageNet: https://www.fast.ai/2018/08/10/fastai-diu-imagenet/

The process is to train with small images in larger batches, then once the loss is decreasing to an acceptable level then a new model is created that accepts larger images transferring the learning from the model trained on smaller images.

As the training image size increases the batch size has to decreased to avoid running out of memory, as each batch contains larger images with four times as many pixels in each.

Note that the defects in the input image have been randomly added to improve the restorative properties of the model and to help it generalise better.

Examples from the validation set separated from the training set are shown here at some of the progressive sizes:

At each image size training of one cycle of 10 epochs is carried out. This is with the backbone weights, frozen.

The image size is doubled and the model updated with the additional grid sizes for the path of larger images through the network. It’s important to note the number of weights does not change.

Step 1: upscale from 32 pixels by 32 pixels to 64 pixels by 64 pixels. A learning rate of 1e-2 was used.

Super resolution to 64px by 64px on a 32px by 32px image from the validation set. Left low resolution input, middle super resolution models prediction, right target/ground truth.

Step 2: upscale from 64 pixels by 64 pixels to 128 pixels by 128 pixels. A learning rate of 2e-2 was used.

Super resolution to 128px by 128px on a 64px by 64px image from the validation set. Left low resolution input, middle super resolution models prediction, right target/ground truth

Step 3: upscale from 128 pixels by 128 pixels to 256 pixels by 256 pixels. Discriminative learning rates between 3e-3 and 1e-3 were used.