Stabilizing neural style-transfer for video

Using noise-resilience for temporal stability in style transfer

by Jeffrey Rainy and Archy de Berker

Style transfer — using one image to stylize another — is one of the applications of Deep Learning that made a big impact in 2017. In this post we discuss the challenges of taking style transfer from still images to real-time video. In the companion piece, we give an overview of Element AI’s video style transfer system, Mur.AI.

The original paper A Neural Algorithm of Artistic Style (Gatys, Ecker, and Bethge; 2015) presented a technique for learning a style and applying it to other images. Briefly, they use gradient descent from white noise to synthesize an image which matches the content and style of the target and source image respectively. The content and style representations they seek to match are derived from features in VGG16 CNN from Oxford University. For more background on style transfer, see our piece on Mur.AI.

The now iconic examples from Figure 2 of Gatys et al (2015).

However, when used frame-by-frame on movies, the resulting stylized animations are of low quality. Subjectively, they suffer from extensive “popping”: inconsistent stylization from frame to frame. The stylized features (lines, strokes, colours) are present one frame but gone the next frame:

Style transfer for video

One solution to the problems with the original method is suggested in a subsequent paper, by Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox titled Artistic style transfer for videos (2016). They present a method of obtaining stable video by penalizing departures from the optical flow of the input video. Style features remain present from frame to frame, following the movement of elements in the original video. However, the method is computationally far too heavy for real-time style-transfer, taking minutes per frame.

We decided to start with a faster option, from Johnson, Alahei, and Fei-Fei’s 2016 paper Perceptual Losses for Real-Time Style Transfer and Super-Resolution. They train another ConvNet to approximate the time-consuming pixel-level gradient descent performed by Gatys et al. The result is much quicker to run, but naively applied to videos, produces the same “popping” problems discussed above.

Our approach

Our implementation combines the innovations of Johnson et al and Ruder et al to produce a fast style-transfer algorithm which significantly reduces the effect of popping on the learned style whilst working in real-time. The stabilization is done at training time, allowing for an unruffled style transfer of videos in real-time. The stability difference is readily visible:

The central idea employed here is that temporal instability and popping result from the style changing radically when the input changes very little. In fact, the changes in pixel values from frame-to-frame are largely noise.

We can therefore impose a specific loss at training time: by manually adding a small amount of noise to our images during training and minimizing the difference between the stylized versions of our original and noisy images, we can train a network for stable style-transfer.

Modifications to the training

We started with the implementation Chainer Fast Neural Style, which we’ll refer to as CFNS from now on.

Our modifications to the training code are located in a fork of CFNS.

We call a model stable if adding noise to some pixels in a source image results in a similar stylization. The gist of our improvement to the training is to add a loss function that captures how unstable our model is. The vanilla CFNS loss function is computed as follows:

We’ve added a fourth loss component at line 177:

Where lambda_noise is a tuning parameter and noisy_y is the stylization of the source image, yy , with noise added at line 146:

The noise image we add is zero everywhere except on noise_count pixels where it is uniformly distributed in [−noise_range,noise_range] , providing two more hyperparameters.

Noise hyperparameters