That was one idea for a title. Another was “I want to do image generation and I don’t know what I’m doing. Help?”. Either way, this is my narrative of the past few weeks of trying to learn a bit of image generation with neural networks by means of reading, hacking, slacking (time for Sopranos), drinking coffee, just a quick reddit… ok back to work! yes, training has finished! Damnit it didn’t work… And so on.

I knew nothing about image generation, and very little about machine learning, but I had been wanting to get into to it for months now. I also wanted to dig into Tensorflow (I’ve only used Torch previously, and very little of it). All I needed was a Project to get started, and a few weeks ago I finally decided what: I was going to generate martian looking terrain (aka height maps). Should be a quick and easy little project for a complete beginner like myself, right? Right?

First thing you need for a machine learning project is data. NASA can be generous from time to time, and when it comes to Mars data, they certainly are. They publish terabytes over terabytes of elevation maps, color maps, I-don’t-know-what-they-do maps, and much more. All at a resolution of 1 pixel per meter, and all for free. So, really really high resolution (and cheap). Here’s what they look like if you plot them with matplotlib:

The real deal. Sampled patches of 64x64, each pixel is 20x20 pixels in the original data.

With that, we have the data. Good start for a machine learning project. Next step is building and training the actual model, but before that could happen I needed to install Tensorflow.

Step 1: How do I generate images?

After a few hours of the usual required ceremonial swearing and offerings to the neumann-gods, I finally got Tensorflow working. A few basic examples, such as XOR and MNIST, and I felt ready to get started with actually generating images. Generating images… Yeah here’s where I realized exactly how much I didn’t have the first clue how to actually generate images, so I rallied to our modern day Oracle of Delphi (Google) and asked what to do. Apparently an algorithm called DRAW was the way to go? Alright let’s do it.

Yeah right. A few sentences in I realized I didn’t even know what all the things they were talking about were. RNNs I know what they are alright. But auto-encoders?

Turns out, after a quick dive into the wikipedia article on auto-encoders, it’s a pretty neat little thing. Essentially, it’s a way to encode a set of information into a smaller set of information. Say you have a 256 numbers, like [4, 10, 9, 34, …]. The auto-encoder can help compress that down to, say, 8 numbers [3, 6, 1, …]. Or 5 numbers. Or whatever you want. The kicker with them is; you don’t need to specify how to do the encoding. That’s what the network will figure out for you.

And the way they do that is pretty cool too. You basically just give it something as an input, and then ask it to output the same thing. With the crux that in the middle, it needs to reduce the whole input to an encoding. So it can’t cheat and just pipe input to output. It needs to figure out how to take the input and encode it (this is what we surprisingly enough call the encoder) and then it needs to take that encoding, and turn it back into the same thing again (decoder).

To generate new images with this, you just train it with source images as input and output. Once your training is done (here’s a good point for a break for coffee) you just disconnect the decoder from the encoder, give it some random values for the encoding, and (at least in theory) you should see a generated image come out. Hopefully one that kinda looks like whatever the network was trained on. (Even more hopefully it’s not exactly one of the images you’ve trained it with because then you might have overcooked your network, aka overfitting).

So I figured, let’s try building that first of all. I hooked up a super simple encoder and decoder, both with 0 (zero) hidden layers (UPDATE: together they count as having 1 hidden layer, see comment in pic bellow), trained it, pulled out the decoder, gave it a bunch of random numbers and.. it actually worked! Here’s a set of generated images from my auto-encoder:

740 epochs, 256 batch size, 16x16 input & output, 32 variable encoding, 0 hidden layers (UPDATE: it’s actually 1 hidden layer as the encoding counts as a layer. Thanks BeatLeJuce for pointing it out). Takes about a minute to train.

Not too bad, right? Well, that’s to understate the excitement I felt at this point. I was expecting it to just generate hopeless noise, and here it was outputting something that somewhat resembled landscapes. Magic!

Now after this early success, I figured I’d just have to scale up the image size, have two coffees instead of one, and I’d be done with it.

Not so.

And here’s where my memory turns fuzzy on what exactly I did and didn’t do to this poor network. I know I tried a lot of different stuff. I tried adding more layers. I tried changing the size of the layers. I tried tweaking parameters. Changing optimizers. Switching activation function from Sigmoid to Tanh to ReLU and then back to Sigmoid again. I went into a haze of convolutional layers (more on that later). I even tried making the network somewhat recursive by always feeding half the last generated image back in, so that it would be able to generate larger images, while still working in patches of 16x16. But I just couldn’t get it to scale up very well.

On top of this, I knew I had only implemented a small part of the DRAW algorithm. I’m a little fuzzy on details, but essentially DRAW doesn’t just run an auto-encoder once, it runs it several times, and each time it’s feeding part of the output of the decoder back into the encoder (this is the RNN part). It’s also not generating the whole image in one go, instead the decoder builds it up over time. And finally, they even added what they call attention to the encoder/decoder. That’s right, this baby decides what to look at on its own. (I can’t wait until someone hooks this up to a motorized webcam). All of this is super cool, but also seemed hard to get right, and above all I wasn’t sure it would work great with martian data as it’s kind of different from, say houses and faces which it’s normally used for. Time to see if there’s anything else out there I could use.

Step 2: DCGAN

After some Googling I came across a paper titled DCGAN. It looked promising; a simpler algorithm than DRAW, but still good results. I also found an implementation by a guy called Taehoon Kim on GitHub, and it didn’t look too complicated.

So what’s a DCGAN you may ask? A lot of it is actually in the name: Deep Convolutional Generative Adversarial Network. The GAN part is similar to the auto-encoder, in that it’s also two networks, but in this case they’re set up to “fight” each other. One of them is tasked with generating images, and the other one is trying to figure out if images it’s fed are fake or real. The first network (generator) gets a little better at tricking the second one (discriminator), so the discriminator needs to get a bit better at telling fake from real. And so on until eventually you end up with a generator that is so good the discriminator can’t tell generated images apart from real ones, at which point it’s outputting some really convincing generated images. That’s the theory anyway.

The Deep Convolutional part of the name refers to how these two networks are set up internally: They’re built as deep convolutional neural networks, which means that they have several layers of what’s called convolution (on the discriminator side) and deconvolution (on the generator side). Let’s quickly jump into what a convolutional layer is.

Normally layers in neural networks are what we call fully connected. This means all inputs are connected to all outputs. Now with many types of data that makes sense; you want all parts of the input data to be able to contribute to all parts of the output data. But images are what we call spatially locally correlated. Fancy way of saying; if there’s a car in an image it doesn’t matter where it is, if it’s in the top right corner, if it’s in the bottom left corner; it’s still a car. So we exploit that. Instead of connecting all input values (pixels) to all output values, we have a much smaller neural network that we slide across the picture, and let it try to recognize stuff locally, outputting the result for each sample. This means we can have far less parameters in a convolutional layer than we would need if it was a fully connected layer. And fewer parameters is what we want.

Ok, back to DCGAN. So inside the discriminator you have several convolutional layers, and in the generator you do the same but with what’s called deconvolutional layers (essentially the inverse of convolution). It might sounds complicated. But in terms of actual code, you can create a convolutional (or deconvolutional) layer with one line of code. Easy.

That’s a whole lot of theory, but in practice it turned out to be fairly straightforward to implement. In about an evening I had something that could generate images that were half dark and half bright (copy pasting heavily from Mr. Kim’s code):

Result of CPU cycles spent on a very silly task.

Pretty useless in itself, but it showed that what I had actually worked. Sweet, I’d just need to connect this with my data set and get it cranking.

Said and done, I spent another few nights trying lots of combinations of strides, filter sizes, filter counts, sharing variables, different activations, learning rates and so on.

And finally, I got to something that actually looked pretty close to the real deal:

DCGAN, 25000 epochs, 16 mini-batch size, 64x64 images, 4 conv layers with stride 2, ranging from 256 to 32 filters

To really get a sense for the performance I also rendered a bunch of samples in 3d:

I made a website that serves as part gallery, part turing test, where you can see a lot more of the 3d images. Go here to check it out! (and see if you can beat the discriminator in figuring out if a landscape is real or fake)

And this is where I think I’ll leave it. There are lots and lots of improvements that could be done. For instance the whole thing doesn’t have any sense for what the real scale is, everything is normalized to 0–1, which means that a rock in the real world turns into a spike in the model. There’s no way to stitch patches together, or generate anything that’s bigger than 64x64 right now. Also I’m training this on a (you’ll laugh) MacBook Air, which means I’m not even using GPU acceleration since it’s not available for Mac yet. I haven’t trained anything for longer than a few hours. And obviously there are endless parameters to tweak and other algorithms to try. But I think that what I have now is enough to say: It’s probably possible to generate very realistic looking landscapes using neural networks and machine learning.

Let’s wrap this up

Well, that’s it. I know it’s a little… all over the place, but I guess that’s an accurate reflection of my process. A couple of thoughts before we part ways:

This stuff is cool as sh*t. I can’t believe someone (like me) with no real training, background or anything can just pick it up and build stuff. It’s also very hackable; you can try this and then that and kinda get a feel for what does what. Visualize visualize visualize. In the end, most of my code ended up being there to help me understand what was happening inside. It seems to me that visualization + shortening feedback loops are the two things that will bring this stuff forward the most in the next few years. ReLU is (apparently) our new lord and savior. Also Adam optimizers seems to kick ass. I know what a ReLU does (now) but no idea what the Adam thing does. But hey it worked. It’s happening NOW. DCGAN and DRAW were both written last year. They’re probably already old in the right circles (I heard something about PixelRNN and PixelCNN? Is that what’s up now? No idea).

Alright that’s all. Hope you enjoyed it, and maybe learned something!