Image to Image translation is a well-known problem that has been very widely researched in Deep Learning.

For those who don’t know image-to-image translation, it is an approach to translate an image from one domain into another domain. Eg. Day ➝ Night, Black-and-White Image ➝ Color Image, Sketch ➝ Image, etc.

One particular problem in the image-to-image translation is Style Transfer, wherein style from one image (style-image) is transferred to another (content-image).

However, unlike other image-to-image translation problems that make use of per-pixel loss as the objective function, it is difficult to measure style transfer in terms of the same loss.

So, how do we decide the loss function for the Style Transfer problem?

A Neural Algorithm for Artistic Style Transfer proposed to use a Convolutional Neural Networks (VGG-19) trained on object recognition to calculate the objective loss for Style Transfer.

The main idea of this paper is that the style and content of an image can be represented separately in Convolutional Neural Networks. This allows us to combine the style representation of one image (style-image) and content representation of another image (content-image) to generate a new style-transferred image.

Style and Content reconstructions using the VGG-19 network. (arXiv:1508.06576)

What does content and style representations really mean?

Content Representation

Higher layers in the network capture the high-level content in terms of objects and their arrangement in the input image but do not constrain the exact pixel values of the reconstruction. In contrast, reconstructions from the lower layers simply reproduce the exact pixel values of the original image.

Content representation of the higher layer (arXiv:1508.06576v2)

We know that in a CNN trained for object recognition, each layer of the network learns a representation of the image and these representations become more specific as we go deep into the layers. For example, the initial layers learn to detect edges and contours while the higher layers learn to detect some objects. This means that image content is better represented in higher layers of CNN, while the lower layers only provide the same pixel values. We use this content representation for calculating Content Loss.

Content Loss

In style transfer, we need the content representation of the content-image and generated-image to be the same. Let’s suppose that the output of a layer of CNN is given by ϕ(x). The Content Loss is simply the Euclidean distance between the content representation of the content-image and generated-image from a particular layer and is calculated as

Content Loss for the jth layer of the VGG network.

Note: Cⱼ, Hⱼ, Wⱼ represent channels, height, and width respectively of the output of the jth layer.

Style Representation

To obtain a representation of the style of an input image, we use a feature space originally designed to capture texture information.

Starry night style representation from one layer. (arXiv:1508.06576v2)

Every layer of the Convolutional Neural Network provides a feature map as an output. For a CNN trained on object recognition, each channel in the feature map represents some aspect of the image eg. edges, circles, spirals, etc. There exist a correlation between different channels of these feature map. Taking these correlations into account from multiple layers, we obtain a multi-scale representation of input image that captures texture.

Style Loss

Using this technique we can obtain style representation of any image. Now to perform a proper style transfer, we need the style representation of the input image and style representation of the reference style-image to be the same. So the distance between these two style representations can be used as a loss which we need to minimize.

But how do we calculate the correlations and the distance between correlations?

Gram Matrices can be used for calculating the correlations between different channels of a feature map. Let’s suppose that the output of a layer of CNN is given by ϕ(x). Then the gram matrix for the feature map can be calculated as

Gram matrix of the jth layer of CNN.

Once we know how to calculate the correlation between features, now we can move forward to calculate the Style Loss. This loss is nothing but Euclidean distance between the gram matrices of style-image (yₛ) and generated-image (ŷ). Note that the paper mentions the Frobenius norm between the difference of gram matrices which is nothing but euclidean distance. Since we are using multiple layers, we sum the distance for each layer.

Style Loss for the jth of the VGG network.

Content Loss and Style Loss together are termed as Perceptual Loss. To be exact Perceptual Loss is the weighted sum of Content Loss and Style Loss.

The weighted sum of Content Loss and Style Loss

Perceptual Optimization

Using the losses we discussed earlier, now we need to set up the algorithm to generate the style transfer image. We use perceptual optimization for this task.

Perceptual Optimization takes in as input a white noise, style-image, and content-image. Our aim is to update white noise in such a way that it matches the style representation of the style-image and content representation of the -content-image. To do this, we calculate the Perceptual Loss i.e Style Loss between white noise and style-image (for matching style representation) and Content Loss between white noise and content-image (for matching the content representation). Once we get the loss we then backpropagate to calculate gradients and update the white noise. We repeat this until the perceptual loss converges to minima.

Note: Instead of white noise we can also use content-image.

A visualization for perceptual optimization.

Pytorch has a very good tutorial for perceptual optimization for neural style transfer. Do check it out!!.

The drawback of this algorithm is that we need to perform perceptual optimization from scratch every time for a new image. This is not efficient.

So, how can we make the process faster?

Transformer Network

Instead of updating the white noise, we can train a network on perceptual loss. This idea was given in the paper Perceptual Losses for Real-Time Style Transfer and Super-Resolution which used an image transformer network and trained it on perceptual loss that was calculated using a pre-trained VGG-16 network.

Training Transformer Network using Perceptual Loss (arXiv:1603.08155)

With this approach, we need to train the network only once. Once we have trained the transformer network we can now use it for style transfer. This takes considerable less time as compared to the previous approach.

However, we can only train the network only for one style-image. We need to train a new network from scratch for a different style-image. Training a completely new network is a time-consuming task.

Is it possible for a single network to learn all styles?

Yes, the answer is Conditional Instance Normalization.

Multiple Style Transfer Network

The paper, A Learned Representation For Artistic Style mentions that many styles probably share some degree of computation and that this sharing is thrown away by training N networks from scratch.

In order to train a single model on multiple styles, we need to have a conditional network.

But where should we put our condition?

The normalization layer can be used to integrate our condition. Before we go into integrating condition lets review what the normalization layer does.

Normalization

The normalization layer takes in the output features of the previous convolutional layer calculates the mean (μ)and standard deviation (σ) and standardized those features. The standardized features are then scaled and translated using learnable weights γ and β.

Normalization

Conditional Instance Normalization

Now that we know how normalization works, we can move forward. It was found that we can use different learnable weights for different styles. Having different learnable weights for each style makes it possible for us to condition the network. By using different γ and β for different styles we are able to learn each style individually.

Conditional Instance Normalization. the subscript “s” is the condition.

Since normalization only scales and translates the features, training an N-style transfer model requires fewer parameters than training N separate networks from scratch.

The perceptual results of the model are similar to single style-transfer networks.

Style transfer result for different styles.

Apart from being able to perform multiple style-transfer. The network also performs well on video input and provides results in real-time.

Learning Neural Style transfer helped me in understanding the working of each layer in Convolutional Neural Networks. Moreover, the techniques used for calculating the loss gave me a good knowledge of how texture can be viewed in Neural Networks. I hope that this post gives you a slight intuition for how the style transfer network works and explains to you the reason behind using the Perceptual Loss function.

If you are interested and want to see how style transfer works then I have an implementation on Github. Please check it out! neural-style-transfer.

Please feel free to share your thoughts regarding the content as this is my first ever post and help me in improving it. Thank You.

References