In two parts, the paper describes an algorithm for rendering a photo in the style of a given painting:

Run an image through a DCNN trained for image classification. Stop at one of the convolutional layers, and extract the activations of every filter in that layer. Now run an image of noise through the net, and check its activations at that layer. Make small changes to the noisy input image until the activations match, and you will eventually construct a similar image. They call this “content reconstruction”, and depending what layer you do it at you get varying accuracy.

Content reconstruction, excerpt from figure 1.

2. Instead of trying to match the activations exactly, try to match the correlation of the activations. They call this “style reconstruction”, and depending on the layer you reconstruct you get varying levels of abstraction. The correlation feature they use is called a Gram matrix: the dot product between the vectorized feature activation matrix and its transpose. If this sounds confusing, see the footnotes.

Style reconstruction, excerpt from figure 1.

Finally, instead of optimizing for just one of these things, they optimize for both simultaneously: the style of one image, and the content of another image.

Here is an attempt to recreate the results from the paper using Kai’s implementation:

In the style of “Composition VII” by Kandinsky

In the style of “The Scream” by Munch

In the style of “Seated Nude” by Picasso

In the style of “The Shipwreck of the Minotaur” by Turner.

In the style of “The Starry Night” by Gogh

Not quite the same, and possibly explained by a few differences between Kai’s implementation and the original paper:

Using SGD , while the original paper does not specify what optimization technique is used. In an earlier texture synthesis paper the authors use L-BFGS.

Initializing with the content image rather than noise.

Using the Inception network instead of VGG-19.

To balance the content reconstruction with the style reconstruction, the paper uses a weighting of 1:10e1 or 1:10e2, while Kai uses 1:5e9, which is a huge and unexplained difference. Running even slightly lower around 1:10e8 it converges mainly on the content reconstruction, only vaguely matching the palette of the style image:

Tübingen in the style of Kandinsky, in an attempt to recreate figure 3.

As I was writing this, Kai added total variational smoothing. This certainly helps with the high frequency noise, but the fact that the original paper does not mention any similar regularization makes me wonder if they achieve this another way.

Comparison of Kai’s implementation without smoothing (left) and with.

As a final comparison, consider the images Andrej Karpathy posted from his own implementation.

Gandalf in the style of Picasso. Left image produced by Andrej Karpathy.

The same large-scale, high-level features are missing here, just like in the style reconstruction of “Seated Nude” above.