What we’d like to find out about GANs that we don’t know yet.

By some metrics, research on Generative Adversarial Networks (GANs) has progressed substantially in the past 2 years. Practical improvements to image synthesis models are being made almost too quickly to keep up with:

Odena et al. , 2016 Miyato et al. , 2017 Zhang et al. , 2018 Brock et al. , 2018

However, by other metrics, less has happened. For instance, there is still widespread disagreement about how GANs should be evaluated. Given that current image synthesis benchmarks seem somewhat saturated, we think now is a good time to reflect on research goals for this sub-field.

Lists of open problems have helped other fields with this . This article suggests open research problems that we’d be excited for other researchers to work on. We also believe that writing this article has clarified our thinking about GANs, and we would encourage other researchers to write similar articles about their own sub-fields. We assume a fair amount of background (or willingness to look things up) because we reference too many results to explain all those results in detail.

What are the Trade-Offs Between GANs and other Generative Models?

In addition to GANs, two other types of generative model are currently popular: Flow Models and Autoregressive Models This statement shouldn’t be taken too literally. Those are useful terms for describing fuzzy clusters in ‘model-space’, but there are models that aren’t easy to describe as belonging to just one of those clusters. I’ve also left out VAEs entirely; they’re arguably no longer considered state-of-the-art at any tasks of record. . Roughly speaking, Flow Models apply a stack of invertible transformations to a sample from a prior so that exact log-likelihoods of observations can be computed. On the other hand, Autoregressive Models factorize the distribution over observations into conditional distributions and process one component of the observation at a time (for images, they may process one pixel at a time.) Recent research suggests that these models have different performance characteristics and trade-offs. We think that accurately characterizing these trade-offs and deciding whether they are intrinsic to the model families is an interesting open question.

For concreteness, let’s temporarily focus on the difference in computational cost between GANs and Flow Models. At first glance, Flow Models seem like they might make GANs unnecessary. Flow Models allow for exact log-likelihood computation and exact inference, so if training Flow Models and GANs had the same computational cost, GANs might not be useful. A lot of effort is spent on training GANs, so it seems like we should care about whether Flow Models make GANs obsolete Even in this case, there might still be other reasons to use adversarial training in contexts like image-to-image translation. It also might still make sense to combine adversarial training with maximum-likelihood training. .

However, there seems to be a substantial gap between the computational cost of training GANs and Flow Models. To estimate the magnitude of this gap, we can consider two models trained on datasets of human faces. The GLOW model is trained to generate 256x256 celebrity faces using 40 GPUs for 2 weeks and about 200 million parameters. In contrast, progressive GANs are trained on a similar face dataset with 8 GPUs for 4 days, using about 46 million parameters, to generate 1024x1024 images. Roughly speaking, the Flow Model took 17 times more GPU days and 4 times more parameters to generate images with 16 times fewer pixels. This comparison isn’t perfect, For instance, it’s possible that the progressive growing technique could be applied to Flow Models as well. but it gives you a sense of things.

Why are the Flow Models less efficient? We see two possible reasons: First, maximum likelihood training might be computationally harder to do than adversarial training. In particular, if any element of your training set is assigned zero probability by your generative model, you will be penalized infinitely harshly! A GAN generator, on the other hand, is only penalized indirectly for assigning zero probability to training set elements, and this penalty is less harsh. Second, normalizing flows might be an inefficient way to represent certain functions. Section 6.1 of does some small experiments on expressivity, but at present we’re not aware of any in-depth analysis of this question.

We’ve discussed the trade-off between GANs and Flow Models, but what about Autoregressive Models? It turns out that Autoregressive Models can be expressed as Flow Models (because they are both reversible ) that are not parallelizable. Parallelizable is somewhat imprecise in this context. We mean that sampling from Flow Models must in general be done sequentially, one observation at a time. There may be ways around this limitation though . It also turns out that Autoregressive Models are more time and parameter efficient than Flow Models. Thus, GANs are parallel and efficient but not reversible, Flow Models are reversible and parallel but not efficient, and Autoregressive models are reversible and efficient, but not parallel.

Parallel Efficient Reversible GANs Yes Yes No Flow Models Yes No Yes Autoregressive Models No Yes Yes

This brings us to our first open problem:

Problem 1 What are the fundamental trade-offs between GANs and other generative models? In particular, can we make some sort of CAP Theorem type statement about reversibility, parallelism, and parameter/time efficiency?

One way to approach this problem could be to study more models that are a hybrid of multiple model families. This has been considered for hybrid GAN/Flow Models , but we think that this approach is still underexplored.

We’re also not sure about whether maximum likelihood training is necessarily harder than GAN training. It’s true that placing zero mass on a training data point is not explicitly prohibited under the GAN training loss, but it’s also true that a sufficiently powerful discriminator will be able to do better than chance if the generator does this. It does seem like GANs are learning distributions of low support in practice though.

Ultimately, we suspect that Flow Models are fundamentally less expressive per-parameter than arbitrary decoder functions, and we suspect that this is provable under certain assumptions.

What Sorts of Distributions Can GANs Model?

Most GAN research focuses on image synthesis. In particular, people train GANs on a handful of standard (in the Deep Learning community) image datasets: MNIST , CIFAR-10 , STL-10 , CelebA , and Imagenet . There is some folklore about which of these datasets is ‘easiest’ to model. In particular, MNIST and CelebA are considered easier than Imagenet, CIFAR-10, or STL-10 due to being ‘extremely regular’ . Others have noted that ‘a high number of classes is what makes ImageNet synthesis difficult for GANs’ . These observations are supported by the empirical fact that the state-of-the-art image synthesis model on CelebA generates images that seem substantially more convincing than the state-of-the-art image synthesis model on Imagenet .

However, we’ve had to come to these conclusions through the laborious and noisy process of trying to train GANs on ever larger and more complicated datasets. In particular, we’ve mostly studied how GANs perform on the datasets that happened to be laying around for object recognition.

As with any science, we would like to have a simple theory that explains our experimental observations. Ideally, we could look at a dataset, perform some computations without ever actually training a generative model, and then say something like ‘this dataset will be easy for a GAN to model, but not a VAE’. There has been some progress on this topic , but we feel that more can be done. We can now state the problem:

Problem 2 Given a distribution, what can we say about how hard it will be for a GAN to model that distribution?

We might ask the following related questions as well: What do we mean by ‘model the distribution’? Are we satisfied with a low-support representation, or do we want a true density model? Are there distributions that a GAN can never learn to model? Are there distributions that are learnable for a GAN in principle, but are not efficiently learnable, for some reasonable model of resource-consumption? Are the answers to these questions actually any different for GANs than they are for other generative models?

We propose two strategies for answering these questions:

Synthetic Datasets - We can study synthetic datasets to probe what traits affect learnability. For example, in the authors create a dataset of synthetic triangles. We feel that this angle is under-explored. Synthetic datasets can even be parameterized by quantities of interest, such as connectedness or smoothness, allowing for systematic study. Such a dataset could also be useful for studying other types of generative models.

- We can study synthetic datasets to probe what traits affect learnability. For example, in the authors create a dataset of synthetic triangles. We feel that this angle is under-explored. Synthetic datasets can even be parameterized by quantities of interest, such as connectedness or smoothness, allowing for systematic study. Such a dataset could also be useful for studying other types of generative models. Modify Existing Theoretical Results - We can take existing theoretical results and try to modify the assumptions to account for different properties of the dataset. For instance, we could take results about GANs that apply given unimodal data distributions and see what happens to them when the data distribution becomes multi-modal.

How Can we Scale GANs Beyond Image Synthesis?

Aside from applications like image-to-image translation and domain-adaptation most GAN successes have been in image synthesis. Attempts to use GANs beyond images have focused on three domains:

Text - The discrete nature of text makes it difficult to apply GANs. This is because GANs rely on backpropagating a signal from the discriminator through the generated content into the generator. There are two approaches to addressing this difficulty. The first is to have the GAN act only on continuous representations of the discrete data, as in . The second is use an actual discrete model and attempt to train the GAN using gradient estimation as in . Other, more sophisticated treatments exist , but as far as we can tell, none of them produce results that are competitive (in terms of perplexity) with likelihood-based language models.

- The discrete nature of text makes it difficult to apply GANs. This is because GANs rely on backpropagating a signal from the discriminator through the generated content into the generator. There are two approaches to addressing this difficulty. The first is to have the GAN act only on continuous representations of the discrete data, as in . The second is use an actual discrete model and attempt to train the GAN using gradient estimation as in . Other, more sophisticated treatments exist , but as far as we can tell, none of them produce results that are competitive (in terms of perplexity) with likelihood-based language models. Structured Data - What about other non-euclidean structured data, like graphs? The study of this type of data is called geometric deep learning . GANs have had limited success here, but so have other deep learning techniques, so it’s hard to tell how much the GAN aspect matters. We’re aware of one attempt to use GANs in this space , which has the generator produce (and the discriminator ‘critique’) random walks that are meant to resemble those sampled from a source graph.

- What about other non-euclidean structured data, like graphs? The study of this type of data is called geometric deep learning . GANs have had limited success here, but so have other deep learning techniques, so it’s hard to tell how much the GAN aspect matters. We’re aware of one attempt to use GANs in this space , which has the generator produce (and the discriminator ‘critique’) random walks that are meant to resemble those sampled from a source graph. Audio - Audio is the domain in which GANs are closest to achieving the success they’ve enjoyed with images. The first serious attempt at applying GANs to unsupervised audio synthesis is , in which the authors make a variety of special allowances for the fact that they are operating on audio. More recent work suggests GANs can even outperform autoregressive models on some perceptual metrics .

Despite these attempts, images are clearly the easiest domain for GANs. This leads us to the statement of the problem:

Problem 3 How can GANs be made to perform well on non-image data? Does scaling GANs to other domains require new training techniques, or does it simply require better implicit priors for each domain?

We expect GANs to eventually achieve image-synthesis-level success on other continuous data, but that it will require better implicit priors. Finding these priors will require thinking hard about what makes sense and is computationally feasible in a given domain.

For structured data or data that is not continuous, we’re less sure. One approach might be to make both the generator and discriminator be agents trained with reinforcement learning. Making this approach work could require large-scale computational resources . Finally, this problem may just require fundamental research progress.

What can we Say About the Global Convergence of GAN Training?

Training GANs is different from training other neural networks because we simultaneously optimize the generator and discriminator for opposing objectives. Under certain assumptions These assumptions are very strict. The referenced paper assumes (roughly speaking) that the equilibrium we are looking for exists and that we are already very close to it. , this simultaneous optimization is locally asymptotically stable .

Unfortunately, it’s hard to prove interesting things about the fully general case. This is because the discriminator/generator’s loss is a non-convex function of its parameters. But all neural networks have this problem! We’d like some way to focus on just the problems created by simultaneous optimization. This brings us to our question:

Problem 4 When can we prove that GANs are globally convergent? Which neural network convergence results can be applied to GANs?

There has been nontrivial progress on this question. Broadly speaking, there are 3 existing techniques, all of which have generated promising results but none of which have been studied to completion:

Simplifying Assumptions - The first strategy is to make simplifying assumptions about the generator and discriminator. For example, the simplified LGQ GAN — linear generator, Gaussian data, and quadratic discriminator — can be shown to be globally convergent, if optimized with a special technique and some additional assumptions. Among other things, it’s assumed that we can first learn the means of the Gaussian and then learn the variances. As another example, show under different simplifying assumptions that certain types of GANs perform a mixture of moment matching and maximum likelihood estimation. It seems promising to gradually relax those assumptions to see what happens. For example, we could move away from unimodal distributions . This is a natural relaxation to study because ‘mode collapse’ is a standard GAN pathology.

- The first strategy is to make simplifying assumptions about the generator and discriminator. For example, the simplified LGQ GAN — linear generator, Gaussian data, and quadratic discriminator — can be shown to be globally convergent, if optimized with a special technique and some additional assumptions. As another example, show under different simplifying assumptions that certain types of GANs perform a mixture of moment matching and maximum likelihood estimation. It seems promising to gradually relax those assumptions to see what happens. For example, we could move away from unimodal distributions . This is a natural relaxation to study because ‘mode collapse’ is a standard GAN pathology. Use Techniques from Normal Neural Networks - The second strategy is to apply techniques for analyzing normal neural networks (which are also non-convex) to answer questions about convergence of GANs. For instance, it’s argued in that the non-convexity of deep neural networks isn’t a problem, A fact that practitioners already kind of suspected. because low-quality local minima of the loss function become exponentially rare as the network gets larger. Can this analysis be ‘lifted into GAN space’? In fact, it seems like a generally useful heuristic to take analyses of deep neural networks used as classifiers and see if they apply to GANs.

- The second strategy is to apply techniques for analyzing normal neural networks (which are also non-convex) to answer questions about convergence of GANs. For instance, it’s argued in that the non-convexity of deep neural networks isn’t a problem, because low-quality local minima of the loss function become exponentially rare as the network gets larger. Can this analysis be ‘lifted into GAN space’? In fact, it seems like a generally useful heuristic to take analyses of deep neural networks used as classifiers and see if they apply to GANs. Game Theory - The final strategy is to model GAN training using notions from game theory . These techniques yield training procedures that provably converge to some kind of approximate Nash equilibrium, but do so using unreasonably large resource constraints. The ‘obvious’ next step in this case is to try and reduce those resource constraints.

How Should we Evaluate GANs and When Should we Use Them?

When it comes to evaluating GANs, there are many proposals but little consensus. Suggestions include:

Inception Score and FID - Both these scores use a pre-trained image classifier and both have known issues . A common criticism is that these scores measure ‘sample quality’ and don’t really capture ‘sample diversity’.

- Both these scores use a pre-trained image classifier and both have known issues . A common criticism is that these scores measure ‘sample quality’ and don’t really capture ‘sample diversity’. MS-SSIM - propose using MS-SSIM to separately evaluate diversity, but this technique has some issues and hasn’t really caught on.

- propose using MS-SSIM to separately evaluate diversity, but this technique has some issues and hasn’t really caught on. AIS - propose putting a Gaussian observation model on the outputs of a GAN and using annealed importance sampling to estimate the log likelihood under this model, but show that estimates computed this way are inaccurate in the case where the GAN generator is also a flow model The generator being a flow model allows for computation of exact log-likelihoods in this case. .

- propose putting a Gaussian observation model on the outputs of a GAN and using annealed importance sampling to estimate the log likelihood under this model, but show that estimates computed this way are inaccurate in the case where the GAN generator is also a flow model . Geometry Score - suggest computing geometric properties of the generated data manifold and comparing those properties to the real data.

- suggest computing geometric properties of the generated data manifold and comparing those properties to the real data. Precision and Recall - attempt to measure both the ‘precision’ and ‘recall’ of GANs.

- attempt to measure both the ‘precision’ and ‘recall’ of GANs. Skill Rating - have shown that trained GAN discriminators can contain useful information with which evaluation can be performed.

Those are just a small fraction of the proposed GAN evaluation schemes. Although the Inception Score and FID are relatively popular, GAN evaluation is clearly not a settled issue. Ultimately, we think that confusion about how to evaluate GANs stems from confusion about when to use GANs. Thus, we have bundled those two questions into one:

Problem 5 When should we use GANs instead of other generative models? How should we evaluate performance in those contexts?

What should we use GANs for? If you want an actual density model, GANs probably aren’t the best choice. There is now good experimental evidence that GANs learn a ‘low support’ representation of the target dataset , which means there may be substantial parts of the test set to which a GAN (implicitly) assigns zero likelihood.

Rather than worrying too much about this, Though trying to fix this issue is a valid research agenda as well. we think it makes sense to focus GAN research on tasks where this is fine or even helpful. GANs are likely to be well-suited to tasks with a perceptual flavor. Graphics applications like image synthesis, image translation, image infilling, and attribute manipulation all fall under this umbrella.

How should we evaluate GANs on these perceptual tasks? Ideally, we would just use a human judge, but this is expensive. A cheap proxy is to see if a classifier can distinguish between real and fake examples. This is called a classifier two-sample test (C2STs) . The main issue with C2STs is that if the Generator has even a minor defect that’s systematic across samples (e.g., ) this will dominate the evaluation.

Ideally, we’d have a holistic evaluation that isn’t dominated by a single factor. One approach might be to make a critic that is blind to the dominant defect. But once we do this, some other defect may dominate, requiring a new critic, and so on. If we do this iteratively, we could get a kind of ‘Gram-Schmidt procedure for critics’, creating an ordered list of the most important defects and critics that ignore them. Perhaps this can be done by performing PCA on the critic activations and progressively throwing out more and more of the higher variance components.

Finally, we could evaluate on humans despite the expense. This would allow us to measure the thing that we actually care about. This kind of approach can be made less expensive by predicting human answers and only interacting with a real human when the prediction is uncertain .

How does GAN Training Scale with Batch Size?

Large minibatches have helped to scale up image classification — can they also help us scale up GANs? Large minibatches may be especially important for effectively using highly parallel hardware accelerators .

At first glance, it seems like the answer should be yes — after all, the discriminator in most GANs is just an image classifier. Larger batches can accelerate training if it is bottlenecked on gradient noise. However, GANs have a separate bottleneck that classifiers don’t: the training procedure can diverge. Thus, we can state our problem:

Problem 6 How does GAN training scale with batch size? How big a role does gradient noise play in GAN training? Can GAN training be modified so that it scales better with batch size?

There’s some evidence that increasing minibatch size improves quantitative results and reduces training time . If this phenomenon is robust, it would suggest that gradient noise is a dominating factor. However, this hasn’t been systematically studied, so we believe this question remains open.

Can alternate training procedures make better use of large batches? Optimal Transport GANs theoretically have better convergence properties than normal GANs, but need a large batch size because they try to align batches of samples and training data. As a result, they seem like a promising candidate for scaling to very large batch sizes.

Finally, asynchronous SGD could be a good alternative for making use of new hardware. In this setting, the limiting factor tends to be that gradient updates are computed on ‘stale’ copies of the parameters. But GANs seem to actually benefit from training on past parameter snapshots , so we might ask if asynchronous SGD interacts in a special way with GAN training.

What is the Relationship Between GANs and Adversarial Examples?

It’s well known that image classifiers suffer from adversarial examples: human-imperceptible perturbations that cause classifiers to give the wrong output when added to images. It’s also now known that there are classification problems which can normally be efficiently learned, but are exponentially hard to learn robustly .

Since the GAN discriminator is an image classifier, one might worry about it suffering from adversarial examples. Despite the large bodies of literature on GANs and adversarial examples, there doesn’t seem to be much work on how they relate. There is work on using GANs to generate adversarial examples, but this is not quite the same thing. Thus, we can ask the question:

Problem 7 How does the adversarial robustness of the discriminator affect GAN training?

How can we begin to think about this problem? Consider a fixed discriminator D. An adversarial example for D would exist if there were a generator sample G(z) correctly classified as fake and a small perturbation p such that G(z) + p is classified as real. With a GAN, the concern would be that the gradient update for the generator would yield a new generator G’ where G’(z) = G(z) + p.

Is this concern realistic? shows that deliberate attacks on generative models can work, but we are more worried about something you might call an ‘accidental attack’. There are reasons to believe that these accidental attacks are less likely. First, the generator is only allowed to make one gradient update before the discriminator is updated again. In contrast, current adversarial attacks are typically run for tens of iterations . Second, the generator is optimized given a batch of samples from the prior, and this batch is different for every gradient step. Finally, the optimization takes place in the space of parameters of the generator rather than in pixel space. However, none of these arguments decisively rules out the generator creating adversarial examples. We think this is a fruitful topic for further exploration.