The SnowGAN

Seasons Greetings With Pix2Pix

First and foremost, a thousand pardons for the cheesy title of this blog post. It’s only five days until Christmas, and I needed to come up with an apt theme. Once I thought of the name, it stuck and I struggled (or didn’t want to) come up with a new one. I’m awful, I know.

TL;DR

If you can’t be bothered to read this entire post, firstly I don’t blame you — it’s Christmas after all, what the hell are you doing reading this post — and two, you can see the output of the network’s learned representation of The Snowman animation right here:

You can also see the GAN trying to predict frames of yours truly in the style of the animation below!

Introduction

For this post, I’ll describe some of my experiments on using neural networks in conjunction with the 1982 short animation, The Snowman. I’ll describe the method, provide the code necessary to produce similar results and leave some musings on the neural aesthetic.

As a side note if you are already a pro GAN user, then there may not be a lot to get technically out of this blog post, but perhaps you’ll enjoy the results; I am approaching this as an artist, not as a scientist or engineer!

Previous Works

Representing and stylising a sequence of images within or using a neural network is not a new one and there have been multiple artworks, papers and blogs that came well before this Medium post that tackle the question of re-synthesising video in an aesthetic manner.

The first port of call would be Terence Broad. Like me, Terence studied at Goldsmiths, and is working in the intersection of artistic practice and machine learning engineering. He is rightly well known for his work on auto encoding Blade Runner, which I had the luck to see when I was in South Korea at Art Centre Nabi. You can read about the technical details in his dissertation or get a overview on his Medium blog post.

Arthur Juliani, a frequent blogger who has wrote some really great pieces wrote a blog post on using Pix2Pix to remaster classic films. He focused on colourising classic films and also filling in unseen spaces by extending the aspect ratio. Although he overfit a lot of the models, we don’t really mind as much as creatives as we are more interested in the aesthetic output of each respective piece, and not the capacity to generalise to new and unseen examples. At least, most of the time — don’t take that last statement as an absolute truth! You can read about Arthur’s work here — it’s very interesting and well written.

Memo Akten is a highly regarded artist who creates loads of useful tools for artists that want to work with code — my favourite being this. He overfit a GAN on images from the Hubble space telescope and asked it to predict images from a live webcam input. Plus he has impeccable musical taste.

One of my favourite artists, although sometimes I get the impression he is more of a machine learning curator, is Gene Kogan. In addition to being a very nice and approachable guy, he is very good for sharing the code that he makes and he regularly creates quality tutorials for artistic machine learning, so I highly recommend you give him a Google if you haven’t come across him before.

He brought style transfer into the real world with his installation, Cubist Mirror. It’s a really nice example of how we can take what is a technically complex process of collating data and running training scripts and makes the machine learning side very accessible to people. That topic is a whole other blog post for another time, but see the Cubist Mirror video below.

Parag Mital runs an awesome Kadenze course which I strongly recommend you check out if you want to learn about TensorFlow. He’s a super nice guy and is doing loads of interesting work with deep learning. Back in 2013 he did some work with Mick Grierson and Tim Smith in visual corpus-based synthesis. Whilst this is a different approach to the neural one of this blog post, it produced some very compelling results and is worth checking out — I have included a link below:

What Is Pix2Pix?

This blog post’s premise is simple; I have one image, and I want it to look like another image — so how do I do that? By employing neural networks, we can learn a representation between these images, which can then be extended to sequences of images, which is of course a video.

There are many approaches that could be done, but we’ll be looking at a special type of neural network described in the paper, Image-to-Image Translation with Conditional Adversarial Networks, which can be found here. However, for your leisure, I’ll try to summarise the main points of the Pix2Pix model.

The model the authors use is something called a conditional generative adversarial neural network. This is admittedly a bit of a mouthful, so I’ll break it down. As we know, a neural network is a way of representing a mapping of data inputs, to data outputs. These inputs and outputs vary from problem to problem, but the key takeaway is that one can train a neural network to represent something as another thing, without explicitly coding the exact instructions (or algorithm) on how to do so.

There are many types of neural network, and an architecture that is particularly popular and immensely successful in the image (and other) domain is the convolution variant. Processing images and understanding them is an immensely tough task; all the computer sees is a mass of pixel values, and from this, the algorithm must infer highly complex structures and concepts from the pixel intensity values. These pixel values sum to the amount of image width * image height * amount of colour channels (red green and blue usually) and if you shift an image one pixel to the right and compare the difference pixel-wise to the original image, there may be a massive amount of difference between the two! A convolutional neural network (CNN) is good at processing images because it has some prior knowledge baked into the algorithm designed for image recognition, which are convolutions. Convolutions have been used in image processing for a long time and are ways of turning an image into another one.

We can see examples of convolutional kernels above and how they transform the image. Now imagine that we have learnable kernels, and it be a little clearer why CNNs are better suited to processing images. CNNs learn by minimising an objective function such as Euclidean distance between an input image and the target output image. This can be problematic however, because using Euclidean distance means we have inadvertently asked the CNN to average all plausible outputs, and therefore may end up with blurry images. Designing objective functions takes a lot of expert intuition, and it would be preferable if the objective function for our arbitrary goals could be automatically learned.

Using a generative adversarial network (GAN) allows a high level goal of making the output images indistinguishable from reality to be automatically learned as the objective function. GANs employ two networks to work together in a game theoretic manner to train one another in an arms race of sorts. The discriminator is one of the two networks comprising a GAN. It must successfully determine which examples are real, and which examples are fake, which the generator must create. A blurred image is easy to spot as a fake, and so the generator learns to not make blurry photos! As one network grows stronger, the other must adapt and improve. Therein lies the difficulty of training GANs; they are notoriously unstable during their training.

Pix2Pix employs a conditional GAN, which has a difference to a traditional GAN. The difference is that the discriminator is fed both the input image and the either real or synthesised output image upon which it then must make the decision on whether the output image is real or fake.

There are a few modifications to how the typical generator and discriminator networks are implemented, mostly in skip connections between layers, which is described as a “U-Net”. This allows for example, the low level convolutional information to reach the high level deconvolutional layer of the generator decoder. There is also an addition of a L1 (absolute) loss to gradients generated from the performance of the generator and discriminator, which helps drive the learning towards a more perceptually relevant result to the input images.

A particularly nice TensorFlow implementation of Pix2Pix can be found here and unlike a lot of the code that I tend to find on the internet, as Steve Jobs might have put it, “it just works”.

Some Considerations

Often forgotten in important deep learning papers is the computers that were required to reproduce the results. Often on social media or where ever you digest the cool new deep learning tech, we’ll see stupendous results from generative machine learning models. These models often came from Google or a similar organisation that has access to an abundance of computational resources. As hackers, artists and practitioners, it is often impossible to reproduce the amazing results without hundreds of parallel GPUs. Some examples would be WaveNet, SampleRNN and lots Reinforcement Learning problems.

Because of this, it is important to highlight experiments one can do feasibly by herself at home, that wont break the bank either through the rental or acquisition of new hardware. All of the work for this project was done over a few days, on my laptop’s old GPU, which I am very impressed by, as it is a very complex representation learning problem.

Getting Data

A key component of the neural network revolution is the abundance of data; humans have become increasingly adept at creating sources of data in this information age that we currently reside in. Mobile phones and software emit large quantities of data that emerge in different formats and structures, and it is up to us as creative technologists and AI practitioners to take advantage of this.

Taking advantage of free online services such as YouTube, Twitter or any website you can think of, is a great way to start scraping data to use for your projects. A nice utility I like is youtube-dl, which makes it easy to download videos.

Once we have have a video, we need to get the frames from it. We can do that using the great utility ffmpeg. We can control the frames dimensions using the scale argument — here we have a video 640 pixels wide and 360 pixels high.

Running this will create many images from your video at the location that you specified at the end of the ffmpeg command.

Colourisation

The first thing I tried was automatic colourisation of the snowman frames. I let the model overfit the data, and the results were essentially the same as the source material. Whilst this is an achievement in itself, the subjective aesthetic quality of the neural network’s output is not particularly remarkable.

Below is the output of the neural network. I fed it each blank and white frame and asked it for an image back. This is the colourised video that I received back.

Canny Edge Detection

There are a multitude of ways we can transform our video frames, and in particular reduce the information present in the input signal. It is by doing this that we force our networks to make mistakes, which may be aesthetically interesting.

If we also would like to feed in new, entirely unseen line drawings to be styled as illustrations from the snowman then using an edge detector is an ideal transformation to make on our video frames. In the original paper the authors referenced Christopher Hesse who made the popular edge2cat. Memo Atken, who I mentioned previously, also made use of edge detection in his work, which you can see below.

We can do similar things by using OpenCV, which has a great implementation of Canny Edge Detection. Using this, we can turn the snowman from a colourised animation into a simple line drawing. We can then use Pix2Pix to learn the mapping between this. Firstly let’s look at the results of the Canny Edge Detection.

To get these frames, we can write a simple Python script to generate the images for us. This script is a parameter-less use of Canny Edge Detection based on the code from this blog post. All you need to do is set the images_directory variable to the folder of video frames you just made with ffmpeg, and the set the target_directory to the path of where you would like the resulting line edged images. Note that the images have been inverted on line 32 to allow black lines on a white background.

Once we have generated the images as line drawings, we can create a dataset using the utility functions that come with the TensorFlow implementation Pix2Pix algorithm.

We now have a dataset of combined images that we can use so we may begin training! To do this, follow the instructions on the GitHub page for the TensorFlow model.

Stitching It All Together

To help visualise the model, I have horizontally stacked the content using the below code. This means that the resultant frame will comprise three images, the left being the input line drawing, the middle image being the image that Pix2Pix generated, and the right image is the original frame from the animation. You’ll need to set the start and end frame variables to whatever the ffmpeg command produced.

Now we have stacked the three images (input, output, target) into one for every frame of video, we can use ffmpeg to stitch it all back together into a new video. The -r flag sets the fps of the video, the -s flag is the width and height and the others are to do with codecs and quality — it’s probably easier to leave them be unless you need something else!

We now have a video of the model, and we can now view the model output! Other useful ffmpeg commands for this project were stripping audio from a video, and adding it back again, which I’ll leave below on the first and second lines of the gist respectively!

Results

After training for just three epochs, results are already very interesting. Again, the left-most image is the canny edge version of the right-most image, which is the original animation frame. The middle image is the image predicted by the conditional GAN after it sees the left-most image.

The network has successfully managed to paint frequent objects it sees throughout the animation, such as the boys ginger hair and the snow. Interestingly it makes mistakes for many of the surfaces such as the walls, floor and snow, where it overlays a static snow flake pattern on the surface. Possibly this snow pattern appeared because the discriminator learned that this was an important artefact and the generator needed to put this in to it’s synthesised images.

If we train further, to nine epochs, the results look like this. I have added in the music for this video.

There is a final process we can now do with our overfit model. We can use the process highlighted before of canny edge detection on a novel video source (like me sat on the sofa) and run it through the model to see what it predicts.

My favourite thing about this video is for most of the frames I have the ginger hair from the boy in the film! It is clear we are creating completely new and interesting moving image that would be difficult to make by hand or with traditional technologies.

The Neural Aesthetic

Gene Kogan has a lecture called The Neural Aesthetic and although he usually delivers it in person you can see a version of it here.

In these neural aesthetic lectures he shows the work he has been both doing and curating, that explores the aesthetic side of machine learning. This approach of amassing artificial intelligence techniques that reflect on creative processes helps to contextualise the work in this blog post; this post is merely copying, explaining and demonstrating what others have done before, and is ultimately a small cog in a much larger machine.

Aesthetics explores the nature of art, beauty and taste, with creations in various types of media. We are in an exciting time of applying machine learning to media and exploring how we can synthesise compelling, new artistic material using connectionist neural processes. Of course, using technology in artistic processes is no new thing; there was a time when mineral pigments blown on to cave walls were the state of the art.

A previous technology being used for creative purposes.

Today, at the forefront and cutting edge of science and art, there are a growing quantity and quality of interesting ways to use computational learning processes as an artistic medium, which are appearing at a frantic rate that we have become accustomed to in the lightning-fast moving field of machine learning. We as collective nerds and creatives, can use these to create abstract and novel ways to generate new forms of aesthetic materials.

It is up to us as members of the creative artificial intelligence community to devise new ways to apply these neural models and come up with new architectures and datasets, to further the scope of what we do with computational processes today.

Thanks for reading! If you want to continue the discussion, leave feedback or just have an argument — leave me a comment here or on Twitter and I’ll be sure to get back to you. All the best and seasons greetings!