Technical Background

In the past 12 months, interest in—and the development of — using artificial neural networks for the generation of text, images and sound has exploded. In particular, methods for the generation of images have advanced remarkably in recent months.

Fig.1 Images of bedrooms generated with DCGAN [Radford et al. 2015]

In November 2015, Radford et al. blew away the machine learning community with an approach of using a deep neural network to generate realistic images of bedrooms and faces using an adversarial training method in which a generator network generates random samples, and a discriminator network tries to determine which images are generated and which are real. Over time the generator becomes very good at producing realistic images that can fool the discriminator. The adversarial method was first proposed by Goodfellow et al. in 2013, but until Radford et al.’s paper, it hadn’t been possible to generate coherent and realistic natural images using neural nets. The important breakthrough that made this possible was the use of a convolutional architecture for the generation of images. Before this it had been assumed convolutional neural nets could not be used effectively for the generation of images, as the use of pooling layers lost spatial information between layers. Radford et al. did away with pooling layers entirely and simply used strided backwards convolutions. (If you are not familiar with what a convolutional neural network is, I made an online visualisation of one.)

Fig.2 Comparison of VAE to VAE with learned similarity metric and GAN [Larsen et al. 2015]

I had been investigating generative models prior to Radford et al.’s paper, but when it was published it was obvious that this was the approach to follow. However generative adversarial networks cannot reconstruct images, they only generate samples from random noise. So I started investigating ways in which to train a variational autoencoder — which can reconstruct images — with the discriminator network that is used in the adversarial approach, or even some kind of network to assess how similar a reconstructed sample is to the real sample. But before I even had a chance to do that, Larsen et al. [2015] published a paper that combined both of those approaches in a very elegant way; by comparing the difference in response of the real and reconstructed samples in the higher layers of a discriminator network, they are able to produce a learned similarity metric that is far superior to a pixel-wise reconstruction error comparison (which otherwise leads to a blurred reconstruction — see Fig.2).

Overview of the variational autoencoder model combined with a discriminator network.

Larsen et al.’s model consists of three separate networks, an encoder, a decoder and a discriminator. The encoder encodes a data sample x into a latent representation z. The decoder then attempts to reconstruct the data sample from the latent representation. The discriminator processes the original and reconstructed data samples, assessing whether they are real or fake; and the response in the higher layers of this network are compared to assess how similar the reconstruction is to the original sample.

I implemented the model in TensorFlow, with the intention of extending it with an LSTM in order to do video prediction. Unfortunately due to time constraints I was not able to pursue this. It did however, lead me to building this model to generate large non-square images. The previous models described both modelled images at a resolution of 64x64 with a batch size of 64, I scaled the network up to model images at a resolution of 256x144 with a batch size of 12 (the largest I could fit on my GPU — a NVIDIA GTX 960). The latent representation has 200 variables, meaning the model is encoding a 256x144 image with 3 colour channels (110,592 variables) into a 200 digit representation, before reconstructing the image. The network was trained on a dataset of all of the frames of Blade Runner cropped and scaled to 256x144. The network was trained for 6 epochs, taking about 2 weeks on my GPU.

Artistic Motivation

Reconstruction of the second Voight-Kampff test

Ridley Scott’s Blade Runner (1982) is the film adaption of the classic science fiction novel Do Androids Dream of Electric Sheep? by Phillip K. Dick (1968). In the film Rick Deckard (Harrison Ford) is a bounty-hunter who makes a living hunting down and killing replicants — androids that are so well engineered that they are physically indistinguishable from human beings. Deckard has to issue Voight-Kampff tests in order to distinguish androids from humans, asking increasing difficult moral questions and inspecting the the subject’s pupils, with the intention of eliciting an empathic response in humans, but not androids.

One of the overarching themes of the story is that the task of determining what is and isn’t human is becoming increasingly difficult, with the ever-increasing technological developments. The new ‘Nexus-6’ androids developed by the Tyrell corporation start to develop their own emotional responses over time, and the new prototype Rachel has had memory implants leading to her thinking that she is human. The method of determining what is human and what is not, is most certainly borrowed from the methodological skepticism of the great French philosopher René Descartes. Even the name Deckard is strikingly similar to Descartes. Deckard goes through the film trying to determine who and who isn’t human, with the unspoken assertion that Deckard himself is having doubts whether he is human.

I won’t go into all of the philosophical issues explored in Blade Runner (there are two good articles that explore this), but what I will say is: that while advances in deep learning systems are coming about by them becoming increasingly embodied within their environments; a virtual system that perceives images but is not embodied within the environment that the images represent, is — at least allegorically — a model that shares a lot with the characteristics of Cartesian dualism, where mind and body are separated.

An artificial neural network however, is a relatively simple mathematical model (in comparison to the brain), and anthropomorphising these systems too readily can be problematic. Despite this, the rapid advances in deep learning are meaning that how models are structured within their environments, and how that relates to theories of mind, must be considered for their technical, philosophical and artistic consequences.