An auto encoder consists of two parts: an encoder, which learns to convert the input ( X ) as a lower dimensional representation ( Z ), and a decoder, which learns to convert the lower dimensional representation back to its original dimensions (Y).

If your encoder is really good, your input and output will be exactly the same. This would be a perfect autoencoder.

In reality though, this is rarely achieved.

So why do we autoencode anyway?

An obvious reason is for compression, right? You’re reducing dimension, so you can just pass around the lower dimensional data around. But think about it, and you’ll realize that even though, we have achieved compression, we still have to pass around the decoder for anyone to be able to understand what we’ve encoded in the first place.

Think of it like passing around notes in Mandarin in London.

If you decide to encode the word ‘tree’. Sure, you reduced what would be 4 characters in English to a single character in Mandarin. But in order to understand what you mean, the other person would also need to know Mandarin. Now you begin to see that in order to faithfully pass around 木 (mù) or its cousins, with any hope of Londoners understanding it, you also need to send a copy of a Mandarin handbook for English speakers.

Clearly, this is not efficient as you thought it would be. Okay, so if not for compression, what else do we use it for?

Well, we mentioned that autoencoders are not perfect, and their decoder often does not produce the exact output that was encoded. What if we use it instead to produce similar output as the input?

This means we can use it to generate even more things that are similar to the input. Like produce more images of flowers from a bunch of images of flowers.

A variational autoencoder (VAE) is a special type of autoencoder that’s specifically designed to tackle this. The VAE can introduce variations to our encodings and generate a variety of output like our input.

Okay, time to embark on our Pokemon journey!

A closer look at our Pokemon

The Pokemon we’ll be working with are the Nintendo DS Style bitmap images. You can grab a copy here.

The resolution of each image is 64 pixels x 64 pixels. I thought this would be a good leap from the friendly MNIST dataset that everybody likes to play with — and, this time it would be in color, for a change.

Each pixel is described by the three RGB values. So for each image is described by 64 x 64 x 3 values. This gives us a total of 12288 values.

Also, notice that each Pokemon is unique and fundamentally distinct. There’s a lot of variation. Just look at the first generation Pokemon.