At Expedition Technology (EXP) we develop a broad set of deep learning solutions for our customers. Each deep learning development cycle typically starts with

Understanding the problem space

Getting acquainted with the research landscape

Tweaking an existing algorithm or developing entirely new architectures

Training on an army of GPUs

This is the standard process, but with a constraint: it requires very large diverse data sets to get good results. As many of our customer’s problems grow more sophisticated, absence of that constraint is becoming an ever rarer occurence. In these cases where data is scarce, there is a necessary additional step – amplifying the data that you have.

For help with this, we have been turning to Generative Adversarial Networks (GANs). Despite their wide-ranging success, deep generative methods are hindered by well-known drawbacks such as unstable minima and mode collapse. We have recently made progress regarding the latter and would like to share our methods with the rest of the deep learning community. In this post we will introduce GANs, describe mode collapse, and then explain how we’ve attempted to mitigate this problem while adding justifications and results to support our claims.

GANs

Generative Adversarial Networks [1] (GANs) are an incredible technology. Although classification and segmentation are necessary problems, they don’t have the catchy, easy-to-appreciate results GANs do. After all, you can’t become a great artist just by learning to distinguish Van Gogh from Monet. You have to actually pick up a paintbrush and try your hand at it. Similarly, if we strive to make intelligent systems, they must be able to not only discriminate, but to generate believable outputs. That’s where we cross the border from a passive to an active agent.

GANs operate by combining two networks – one that creates output, and one that provides feedback. The ‘generator’, as it’s called, is provided a random input and tries to return a correspondingly random output. The ‘discriminator’ then compares this generated sample to real world ones and gives a zero to one score of how believable it is. It’s really just a competition: the generator is trying to fool an ever-improving discriminator. If you let them duke it out a few million times, you end up with a discriminator that learns the real world from the fake world, as well as a generator that does a pretty good job at making realistic looking samples.

This is a powerful tool, as it theoretically allows for creating unlimited additional data. If the generated samples are within the set of all possible inputs, then we can turn 100 data points into 1000 by letting the generator hallucinate 900 new but plausible examples.

Mode collapse

There’s a problem, though. Let’s look at the following situation [2] as a GAN tries to make pictures of cars:

After bumbling around for a bit, the generator learns to draw convincing Honda Civics The discriminator picks up on this and starts labeling most Honda Civics as generated In response to this, the generator tweaks its algorithm a bit and begins making a similar but separate class – Honda Accords Now the discriminator has to adjust, so it starts calling Honda Accords fake While the discriminator is distracted by Accords, the opportunity presents itself to start making convincing Civics again, which the generator happily reverts to Repeat steps 2-5

This infinite loop of similar outputs is termed mode collapse, and it is one of the things restricting GANs from being widely used as a data amplification tool. The consequence of mode collapse is that we cannot create an unlimited supply of unique samples, since our generator only flicks back and forth between a couple very similar outputs. This minimally satisfies the job of fooling the discriminator but is ultimately unhelpful if we are trying to stretch the effectiveness of our currently available data.

How to avoid mode collapse

To reconcile this, we decided to add a constraint: the generator outputs must be random, but in such a way that any such random output is believable. An intuitive way to enforce this is to find some compressed space Χ that is densely packed with examples, such that any point within that space corresponds to a true data sample. If we can also find a bijection f: Χ→Y from X, our densely packed space, to Y, our space of real examples, then we can randomly sample Χ, and convert those points to plausible outputs.

Luckily for us, autoencoders are great at finding exactly such a space and such a function. The basic idea is that an autoencoder takes input, processes it to a lower dimensionality vector, then reconstructs the input from that vector. The bottleneck in the middle, then, contains the relevant information about the input with fewer variables, providing us a compressed space, referred to as the latent space. The decoder, given a point in that space, recreates the input that was encoded, which provides us with our bijection f. This relies on two assumptions that we will provide evidence for in the next section.

What does this all mean? If we set up an autoencoder to densely encode inputs to a latent space, then any randomly sampled point in that latent space should give a realistic, equally random output upon decoding. Somewhat surprisingly, with a small enough dimensionality of the latent space, this actually works.

To employ this effectively, we make a small GAN that finds a sub-basis of this latent space, and then take random samples from this sub-basis. In practice, this means that we train a GAN to generate a batch of vectors, enforce that they are orthogonal using their dot product, and then take random linear combinations of these vectors. The discriminator then decides whether these linear combinations are convincing latent space encodings. Those that fool the discriminator get decoded into realistic samples. Due to the sampling being random and the decoder being a bijection, our results are random elements that are indiscernible from the true data. See the figure below for some examples of non-cherrypicked eights generated by the network.

The reason for having the GAN find a sub-basis is that it is difficult to find a perfect dimensionality of the latent space. This means that not every one of the axes is guaranteed to be utilized evenly. Therefore, it is more sensible to choose a dimensionality that allows the autoencoder some leniency, and to then let the generator learn the necessary basis of ‘highest plausibility’.

This approach is reminiscent of variational autoencoders (VAEs) [4], which also encode the data samples for the purposes of generation. VAEs, however, sample the latent space differently, electing instead to add random std. normal vectors to the encodings. In a VAE, the normal vectors are based on a mean and standard deviation that are also created by the encoder. In our approach, the encoder simply defines the latent space, which is then sampled by a wholly separate GAN.

Reasoning for why this works

There are two critical assumptions that substantiate our approach:

The latent space is densely packed The decoder approaches a bijection

We provide two points of evidence to show that the latent space is densely packed. The first is a thought experiment. Given inputs that have 10 independent variables, and an encoded vector of length 5, we should expect that an autoencoder learns to utilize every degree of freedom to its fullest extent. If, instead, it only uses three axes of the five provided to it, the autoencoder will be further from representing the ten independent variables of the input space, implying that an easy lower minimum is available on the error landscape. This presents the caveat that our encodings need to be smaller in dimensionality than the number of independent variables in the input space. Such a requirement ensures that the optimal encoder takes advantage of every axis provided to it. Simply said, if you don’t give the encoder adequate dimensionality to represent the information, it must learn to take advantage of everything it has.

The second point is empirical, as seen by traveling through a latent space. It turns out, if we encode two handwritten MNIST digits to a latent space, the points between their encodings also represent plausible outputs, as seen in the figure [3] below. This implies that, given two known points in latent space, any point randomly between them is likely to also represent believable outputs. Our approach treats the latent representations differently by making a unique space for each digit, rather than a single latent space for all of them. In either case, the result should still hold.

Towards the second assumption, it is not true that the decoder is a true bijection, in part due to the discrete nature of the dataset. However, we can make a case that the decoder of a functional autoencoder will approach a bijection, as long as the encodings map to a densely packed space. We do this by showing that the encoder approaches a bijection from true inputs to a unique point in the latent space. The decoder then, as the inverse of the encoder, must learn the inverse bijection.

Before explaining the reasoning for the decoder being a bijection, we want to touch on why this is necessary. A bijection is a function f: X →Y that is both ‘onto’ and ‘one-to-one’. This means that any possible value O ∈ {Outputs} has exactly one corresponding input I for which f(I) = O. If both the encoder and the decoder are bijections, then any point randomly sampled in the latent space must have a unique, correspondingly random point in the true data space.

We can claim that the encoder is ‘onto’ as a consequence of our reasoning for the latent space being densely filled. In order to fill that dimensionality, the encoder must attempt to map the inputs into different locations within the latent space. As such, if the whole constrained-dimensionality latent space is filled, then the encoder is onto. We can also show that a working autoencoder’s encoder is ‘one-to-one’ by contradiction. If it were not one-to-one, then two different inputs could map to the same latent representation. Due to the assumption that the autoencoder is functional, this point in the latent space would be decoded back out to the two different inputs. This is not possible by the definition of a function. As such, an optimal encoder approaches a bijection, therefore the decoder must also do the same.

These assumptions come together for the logic of our generative approach. Autoencoders can find a latent space in which every point maps to plausible outputs, and simultaneously approximate the bijection between this latent space and the output space. Therefore, randomly sampling the dense latent space corresponds to randomly sampling the set of realistic data samples. The quality of decoded samples is then a direct result of how ‘bijective’ the encoding and decoding operations are.

Results

The ultimate goal is to amplify our existing data by generating new samples that are indiscernible from the original set. To this end, we set up an experiment where we trained a basic MNIST classifier on the full train set, on a tenth of the train set, and on a tenth of the train set along with generated samples. The GAN in this case was also trained on the same tenth.

We trained the GAN on each digit independently and created 5000 new samples for each. Upon training the classifier with GAN input, we split each batch as either 25, 50 or 75 percent composed of generated digits. The rest of each batch was taken from the tenth of the train set.

We found that the network trained on a tenth of the dataset plus generated samples is more accurate on the test set than the network trained without generated samples. Specifically, we see a decrease in the error rate of up to 17% after training on our amplified dataset.

Train set All train data Tenth of train data Tenth of train data and generated 75/25 Tenth of train data and generated 50/50 Tenth of train data and generated 25/75 Test set accuracy 96.85% 94% 94.3% 95% 92.6%

References: