Motivation

A few weeks ago I attended one of Gene Kogan’s workshop on pix2pix and deep generative models at Spektrum Berlin. There he presented a few artistic projects that have made use of generative models. One of those projects that he showed was his own where he used face tracker to create a generative model that is able to mimic Trump:

This kind of demo was really refreshing for me as I’m usually not exposed to those kind of projects at my job. After this workshop I decided to create my own project similar to what Gene did with the face tracker but with a different person.

Generate Training Data

The first thing I had to do was to create the dataset. For this, I used Dlib’s pose estimator which can detect 68 landmarks (mouth, eyebrows, eyes etc…) on a face along with OpenCV to process the video file:

Detecting the facial landmarks is a two-stage process. First, a face detector is used to detect the faces and then the pose estimator is applied on the detected face.

The pose estimator is an implementation of the paper: One Millisecond Face Alignment with an Ensemble of Regression Trees by Vahid Kazemi and Josephine Sullivan, CVPR 2014

One of the problem that I had was that at my first implementation the face landmark detector was extremely laggy (very low frames per second-fps). I figured out that input frame was just too big. Reducing the size of the frame by factor four improved fps a lot. On another blog article by Satya Mallick, he also recommended to skip frames but I didn’t do this as fps was decent enough now. However, this is something that I can try out later to improve performance even more.

I looked up several potential videos on YouTube that I could use to create the data that ranges from interviews to speeches from prominent persons. At the end, I decided to go with Angela Merkel’s (German chancellor) New Year’s speech in 2017. This video was especially suited as the camera position was kind of static so that I could get a lot of images with the same positions of her face and background.

One sample of the training data.

Training the Model

Luckily, at the workshop Gene also pointed out some existing codebase for generative models like pix2pix. So I didn’t need to research a lot. For training of the model, I used Daniel Hesse’s amazing pix2pix TensorFlow (TF) implementation which is really well documented. Daniel also gives a nice introduction on pix2pix on his own blog. If you haven’t looked at it, you should! Spoiler: It also makes use of hello kitty! 🐱

Notes:

The original implementation of pix2pix is actually in Torch but I like TensorFlow more.

Also if you don’t know what pix2pix is or generative models in general, you could think of PCA or Autoencoders where its main goal is reconstruction. The main difference to those models though is that the “reconstruction” in generative models involve some randomness in generating the output data. For example, Variational Autoencoder is a easy to understand, generative model if you already know the basic of Autoencoders.

Gene is also working on a tutorial for Pix2Pix. I think it’s not finished yet but on his page you can find a lot of other show cases e.g. the neural city by Jaspaer van Loenen is also pretty cool.

Another cool application that I found is by Arthur Juliani where he used pix2pix to remaster classic films in TensorFlow. Here is a short remastered scene from the movie Rear Window which I took from his article:

Top: input video. Middle: pix2pix output. Bottom: original remastered version.

So after I cloned Daniel’s repo and processed the data with his helper scripts, the main challenge was rather the actual training itself as training the model may take up to 1–8 hours depending on GPU and the actual settings like number of epochs, images etc.. Training on CPU was excluded right away as this could take up hours of hours. As always, as I don’t have a machine with GPUs at home (I know it’s time to invest into such a machine^^), I had to rely on cloud services.

Pix2Pix model graph in TensorFlow.

Usually, my first choice is AWS and its G2 instances but this time I used FloydHub instead. And I must confess, it’s pretty cool. I’ve read about them on Hacker News and wanted to try them out. The good thing is that currently when you register an account there, you’ll get 100 hours of free GPU which is pretty awesome. They also have a very good CLI tool which I always prefer over a GUI (but they have one as well). Training the model which involves one command line was also very easy. The only critique so far that I have is that you cannot ssh into the container after the training unlike what you can do on AWS. Sometimes, you just need to change one thing without the need to re-upload all your files and this is especially annoying if your files are big. Anyway, on the other hand you save money but yeh there are always pros and cons.

Then I finally ran the actual training on 400 images that I generated. In terms of this, 320 was used for training and the rest for validation. Moreover, I trained the model with different number of epochs (5*, 200 and 400). Training time varied depending on the settings from 6mins to more than 7hours.

*That was more for testing purposes. At low epochs, the generated output of Angela Merkel was quite pixelated and blurred but you could already see contours of her expressions quite well.

Here are now some results for one of the experiment that I ran (epoch 200). As you can see the learning process for both the discriminator and generator was quite noisy:

Plot for the discriminator and generator loss.

If we look at the output summary for different steps, we can see that is probably caused by the flag:

Output summary for different steps.

Increasing the number of epochs helped to reduce the pixelation a little bit but the problem with the flag still prevailed.

Detected facial landmarks and the generated output for epoch 400.

Interestingly, I noticed that depending on how I spun my head, the position of the flag also changed. I guess to improve my results I probably need to train it with more data.

Serving the Model

After training the model, it was time to build the application with OpenCV on top of that. The biggest challenge there was actually how to integrate this model in the app as Daniel’s implementation is not really suited for this. His implementation is rather highly optimized for training, for example he used queues to read in the data. This is really good for training but I didn’t see the necessity when serving the model. Moreover, as you know when saving models in TF, a lot of files are created like the checkpoint, the weights and metadata of the graph itself. But in production, we don’t need any of those metadata files as we just want our model and its weights nicely package in one file (if you want to know how to do this, Morgan Giraud wrote a nice tutorial about this). Therefore, I had to do quite a bit of reverse engineering to make it more usable for the app.