Have you ever woken up in the middle of the night and wondered whether Gradient Descent, Adam or Limited-memory Broyden–Fletcher–Goldfarb–Shanno will optimize your style transfer neural network faster and better?

Me neither, and I didn’t know what half of these words meant until last week. But as I’m participating in part 2 of the excellent Practical Deep Learning For Coders, and nudged by the the man behind the course Jeremy Howard, I figured I’d explore which optimizer is best for this task.

NB: Some knowledge of Convolutional Neural Networks is assumed. I highly recommend part 1 of the course which is the available online for free. It’s the best way to get your feet wet in machine learning (ML).

What is Style transfer and how does it work?

Let’s start with some of the basics, partly because I was a little unclear of those prior to writing this. If you are familiar with style transfer, you might skim/skip this section.

Q: Um, what is style transfer?

A: It’s what apps like Prisma and Lucid are doing. Basically, it extracts the style of an image (usually a famous painting) and applies it to the contents of another image (usually supplied by the user).

The style of an painting is: the way the painter used brush strokes; how these strokes form objects; texture of objects; color palette used.

The content of the image is what objects are present in this image (person, face, dog, eyes, etc.) and their relationships in space.

Here is an example of style transfer:

Landscape (content) + Scream (style)

Q: But how do I separate the style and content of an image?

A: Using convolutional neural networks(CNNs). Since AlexNet successfully applied CNNs to object recognition (figuring out what object is in an image), and dominated the most popular computer vision competition in 2012, CNNs have been the most popular and effective method for object recognition. They recognize objects by learning layers of filters that build on previous layers. The first layers learn to recognize simple patterns, for example an edge or a corner. Intermediate layers might recognize more complex patterns like an eye or a car tire, and so on. Jason Yosinski shows CNNs in action in this fun video.

It turns out that the filters in the first layers in CNNs correspond to the style of the painter — brush strokes, textures, etc. Filters in later layers happen to locate and recognize major objects in the image, such as a dog, a building or a mountain.

By passing a Picasso painting through a CNN, and noticing how much filters in the first layers (style layers) are activated, we can obtain a representation of the style Picasso used in it. Same thing for the content image, but this time with the filters in the last layers (content layers).

Q: Ok, then how do I combine the style and content?

A: Now it gets interesting. We can compute the difference between the styles of two images (style loss) as the difference between the activations of the style filters for each image. Same thing for the difference between content of two images (content loss): it’s the difference between the activations of the filters in the content layers of each image.

Let’s say we want to combine a the style of Picasso painting with a picture of me. The combination image starts off as random noise. As the combination image goes through the CNN, it excites some filters in the style layers and some in the content layers. By summing the style loss between the combination image and the Picasso painting, and the content loss between the combination image and my picture, we get the total loss.

Content, Style and the initial combination image

If we could change the combination image as to minimize the total loss, we would have an image that is as close to both the Picasso painting and that picture of me. We can do this with an optimization algorithm.

Q: An optimization algorithm?

A: It’s a way to minimize (or maximize) a function. Since we have a total loss function that is dependent on the combination image, the optimization algorithm will tell us how to change the combination image to make the loss a bit smaller.

Q: What optimization algorithms are there?

The ones I’ve encountered so far fall in two camps: first- and second-order methods.

First-order methods minimize or maximize the function (in our case the loss function) using its gradient. Most widely used first-order method is Gradient Descent and its variants, as illustrated here and explained in Excel(!).

Second-order method use the second derivative (Hessian) to minimize or maximize the function. Since the second derivative is costly to compute, the second-order method in question, L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) uses an approximation of the Hessian.

Which is the best optimization algorithm?

It depends on the task, but I mostly use Adam. Let’s see which one is fastest in style transfer.

Setup: The learning rate in the following experiments is set to 10 which might seem high but works out fine because we are dealing with color intensities between 0 and 255. The rest of the hyperparameters of the optimizers are left at their default values. The tests were performed on a single K80 GPU on Amazon P2 Instance.

Experiment 1: 100 iterations, 300 x 300 pixels

We’ll start off with Picasso and a picture of a beautiful girl. They are both sized 300 by 300 pixels.

We’ll run the optimizer for a 100 steps. It’s not sufficient to get a good combination image, but will allow us to see which optimizer minimizes the error faster.