Those who are not familiar with style transfer can read this blog post summarizing style transfer.

As a participant in Jeremy Howard’s part 2 course “cutting edge deep learning for coders”, I wanted to build a style transfer neural network which performs at par with prisma (both in terms of time to generate an image and quality of image). The project involved many interesting challenges (quality of images, inference time).

I first conducted many experiments to see how the quality of generated image can be improved. The following two blog posts contain the experiments in extensive detail:

This blog post summarizes my learnings from those experiments and contains some more practical tricks for making style transfer work.

Style transfer in practical setting

The above two blog posts used the image optimization based method for style transfer, i.e image pixels were learned by minimizing the loss. In practical settings, say you are building an app like prisma, this technique is not useful because it is very slow. So instead a neural network based method is used, wherein a neural network is trained to generate a stylized image from a content image. These were the times taken for a 1000x1000 image on AWS p2x.large instance using both the methods:

Optimization based method (NVIDIA K80 GPU): 172.21 secs Neural network based method (NVIDIA K80 GPU): 1.53 secs Neural network based method (Intel Xeon E5–2686v4): 14.32 secs

Clearly, neural network based method is the winner here. Neural network based style transfer is just an approximation of optimization based method but is much faster.

For a detailed understanding of how the neural network based method works, you can read these papers:

The optimization based method is described in:

Improving the quality of the generated image

One thing that makes deep learning notoriously hard is the search for optimal hyper-parameters. And style transfer involves various hyperparameters like which content layers to use for calculating content loss, which style layers to use for calculating style loss, content weight (constant to multiply with the content loss), style weight (constant to multiply with the style loss), tv weight (constant to multiply with the total variation loss). Thankfully, finding optimal hyperparameters is much easier in style transfer.

The simple trick is:

Instead of using neural network based method for training and finding optimal hyperparameters, you can simply use the optimization based method on your test images to create their stylized versions and cross validate.

Following are some of the things I learned from my experiments about how to use image optimization based method to find stylization hyperparameters.

1. Making output images smoother

Total variation loss is the sum of the absolute differences for neighboring pixel-values in the input images. This measures how much noise is in the images.

As shown in experiment 10 in this blog post, adding total variation loss to the training loss removes the rough texture of the image and the resultant image looks much smoother.

2. Choosing style layers

In most published work (including the seminal paper), some layers are chosen from the network for calculating the style loss and are given equal weights. As I show in experiment #1 in this blog post, it doesn’t make sense because the contribution of style loss from different layers can differ by a huge order of magnitude. So what I did for this purpose was to first find outputs from each of these layers, one at a time. Then create a combined image by giving weights to each of these layers. Weights can again be found via cross validation.

3. Choosing content layers

Unlike the case with choosing style layers, choosing a combination of content layers doesn’t make much of a difference because optimizing content loss from one content layer optimizes for all subsequent layers. Which layers to use can again be found via cross validation (as shown in experiment #2 of this blog post).

4. Modifying the loss network to improve convergence

vgg-16/vgg-19 networks (pretrained on imagenet) are generally used to calculate content loss and style loss. As I show in experiment 8 of this blog post, replacing max pooling with avg pooling decreases time taken to converge significantly.

The other change I tried, replace SAME padding with VALID padding did not make much of a difference.

5. Size and cropped region of style image

As I show in experiment 4 of this blog post, size of the style image matters. If you are using an style image of smaller size, the network is not able to figure out enough style information and the images generated are of poor quality. From my experiments, I would say the size of style image should be 512 on each side at least.

As shown in experiment 11 of this blog post, The cropped region of the style image also matters, and this should be chosen on the basis of the region containing the brushstrokes we want in our stylized image.

6. Choosing content, style weight and tv weight

These weights are to be found via cross validation only. But here are couple of rules that I followed and worked well.

When you start training, first the style loss decreases rapidly to some small value, then the content loss starts decreasing and style loss either decreases much more slowly or fluctuates. This graph of content loss and style loss decreasing can be seen in Experiment #6 of this blog post. content weight should be such that the content loss is significantly greater than the style loss at this point. Otherwise network will not learn any of the content.

tv weight should be such that tv loss is half the content loss at later iterations in the training. This is because if tv loss is very high, colors from neighboring objects in the image get mingled and the image is of poor quality.

7. Choosing loss network

Experiments #3, #4, #5, #6 of this blog post detail how different kinds of images can be generated by using different inception networks. The images generated using inception networks look more like crayon paintings.

Experiment #5 of this blog post compares the image generated using vgg-16 and vgg-19. My guess is that images generated using vgg-16 can be generated using vgg-19 too by just adjusting the content and style loss. This is because the pastiches generated from these networks were not that different. And the images generated using vgg networks look more like oil paintings.

So if you are targeting oil painting texture, use vgg-16 and vgg-19. Use any one of the inception networks (after cross validation) if you are targeting crayon texture.

Improving training and inference times

Say you are trying 10 different architectures for style transfer to find out which works best. It becomes very important to speed up training to cross validate your models quickly.

If you are competing with Prisma, you need to ensure that the inference time of your network is comparable to theirs.

Here are the tricks I used to improve the training and inference times.

1. Down-sample the training image

The simplest thing to do is down-sample the image from say 512x512 to 256x256, and then test how good the generated images look like. It decreases the training time by 4 times but can be easily used for finding the optimal network architecture, because the quality of images generated using 512 is quite similar to the quality using 256.

2. Use depth-wise separable convolution in place of convolution

This is the most standard trick for improving the inference time, where convolution layers are replaced by depth-wise separable convolution. This causes the computation to decrease significantly. Below are the CPU and GPU inference times for images with convolution and depth-wise separable convolution on an aws p2x large instance.

inference times using convolution (in secs)

inference times using separable convolution (in secs)

As it can be observed, inference times for the image roughly get halved on CPU, decrease by 5–6 times on GPU. There is no visible difference in the quality of generated images.

3. Down-sample the image, add an up-sampling layer followed by convolution layer in the end

I tried many non-standard architectural changes to decrease the inference time. One was to reduce number of layers in the neural net. Other was to change the convolution operation (convert a filter_size*filter_size*in_depth*out_depth convolution to a filter_size*filter_size*in_depth*k followed by a 1*1*k*out_depth convolution where k<<out_depth). Another thing I tried was to reduce filter_size for computationally expensive convolution layers. None of these created beautiful stylized images, and they didn’t help much with time either.

The trick that worked was, halve the height and width of the image by bi-linear interpolation just before feeding it into the neural network. And add an up-sampling layer followed by a convolution layer at the end. This way, you are training your neural network to learn style transfer and super resolution.

Why it helps? The computation of all intermediate layers gets reduced by 1/4 compared to the previous method. The additional computation added is due to the one extra layer and due to the bilinear interpolation. But they don’t add a lot to the time and net time is much lesser. These are the CPU and GPU inference times for images on an aws p2x large instance.

inference times using baseline architecture (in secs)

inference times using the resize trick (in secs)

As we can see, this halves the CPU inference time significantly. And also decreases the GPU inference times for most images.

Combining this with separable convolution trick gives:

inference times using separable convolution and resize trick (in secs)

As we can see, using both the tricks together decreases the CPU inference times of all images by 3 times and GPU inference times by 8–10 times. I ran the same neural network on phone. It was taking 6.5 seconds for a 1000x1000 image. For comparison, prisma takes 9 seconds for the same image.

And what’s even more cool? These two tricks apply to any image generation method, not just style transfer.

Also, the quality of images generated is the same: