Written by Avital Oliver

on May 28, 2018

InfoGAN



We welcome Want to make DFL better?We welcome suggestions and improvements

Thank you to Kumar Krishna Agrawal, Yasaman Bahri, Peter Chen, Nic Ford, Roy Frostig, Xinyang Geng, Rein Houthooft, Ben Poole, Colin Raffel and Supasorn Suwajanakorn for contributing to this guide.

Concepts used in InfoGAN. Click to navigate.

Why

InfoGAN is an extension of GANs that learns to represent unlabeled data as codes, aka representation learning. Compare this to vanilla GANs that can only generate samples or to VAEs that learn to both generate code and samples. Representation learning is an important direction for unsupervised learning and GANs are a flexible and powerful interpretation. This makes InfoGAN an interesting stepping stone towards research in representation learning.

Reproduce in a Notebook





1 Information Theory

Motivation: Information theory formalizes the concept of the “amount of randomness” or “amount of information”. These concepts can be extended to relative quantities among random variables. This section leads to Mutual Information (MI), a concept core to InfoGAN. MI extends entropy to the amount of additional information you yield from observing a joint sample of two random variables as compared to the baseline of observing each variable separately.

Topics:

Entropy Differential Entropy Conditional Entropy Jensen’s Inequality KL divergence Mutual Information

Required Reading:

Optional Reading:

Questions:

From PRML: 1.31, 1.36, 1.37, 1.38, 1.39, 1.41. Solution PRML 1.31: Consider two variables \(x\) and \(y\) having joint distribution \(p(x,y)\). Show that the differential entropy of this pair of variables satisfies \(H(x,y) \le H(x) + H(y)\) with equality if, and only if, \(x\) and \(y\) are statistically independent. If \(p(x)\) and \(p(y)\) are independent then the joint distribution is given by:

\(p(x,y) = p(x)p(y)\) Based on the independent \(p(x)\) and \(p(y)\) the joint entropy can be derived from the conditional entropies \(H(x|y)\) and \(H(y|x)\):

\(H(x|y) = H(x)\)

\(H(y|x) = H(y) \to\)

\(H(x,y) = H(x) + H(y|x) = H(y) + H(x|y) \to\)

\(H(x,y) = H(x) + H(y)\) Therefore, there is no mutual information \(I(x,y)\) if \(p(x)\) and \(p(y)\) are independent:

\(H(x,y) = H(x) + H(y)\ \to\)

\(I(x,y) = H(x) + H(y) - H(x,y) = H(x,y) - H(x,y) = 0\) If \(p(x)\) and \(p(y)\) are dependent:

\(H(x,y) < H(x) + H(y)\ \to\)

\(I(x,y) = H(x) + H(y) - H(x,y) > 0\) The indepent and dependent case can be combined to a general form:

\(H(x,y) \le H(x) + H(y)\ \to\)

\(I(x,y) = H(x) + H(y) - H(x,y) \ge 0\) How is Mutual Information similar to correlation? How are they different? Are they directly related under some conditions? Solution Start here. In classification problems, minimizing cross-entropy loss is the same as minimizing the KL divergence of the predicted class distribution from the true class distribution. Why do we minimize the KL, rather than other measures, such as L2 distance? Solution In classification problem: One natural measure of “goodness” is the likelihood or marginal probability of observed values. By definition, it’s \(P(Y | X; params)\), which is \(\prod_i P(Y_i = y_i | X; params)\). This says that we want to maximize the probability of producing the “correct” \(y_i\) class only, and don’t really care to push down the probability of incorrect class like L2 loss would. E.g., suppose the true label \(y = [0, 1, 0]\) (one-hot of class label {1, 2, 3}), and the softmax of the final layer in NN is \(y’ = [0.2, 0.5, 0.3]\). One could use L2 between these two distributions, but if instead we minimize KL divergence \(KL(y || y’)\), which is equivalent to minimizing cross-entropy loss (the standard loss everyone uses to solve this problem), we would compute \(0 \cdot \log(0) + 1 \cdot \log (0.5) + 0 \cdot \log(0) = \log(0.5)\), which describes exactly the log likelihood of the label being class 2 for this particular training example. Here choosing to minimize KL means we’re maximizing the data likelihood. I think it could also be reasonable to use L2, but we would be maximizing the data likelihood + “unobserved anti-likelihood” :) (my made up word) meaning we want to kill off all those probabilities of predicting wrong labels as well. Another reason L2 is less prefered might be that L2 involves looping over all class labels whereas KL can look only at the correct class when computing the loss.





2 Generative Adversarial Networks (GAN)

Motivation: GANs are framework for constructing models that learn to sample from a probability distribution, given a finite sample from that distribution. More concretely, after training on a finite unlabeled dataset (say of images), a GAN can generate new images from the same “kind” that aren’t in the original training set.

Topics:

JS (Jensen-Shannon) divergence How are GANs trained? Various possible GAN objectives. Why are they needed? GAN training minimizes the JS divergence between the data-generating distribution and the distribution of samples from the generator part of the GAN

Required Reading:

Optional Reading:

Questions:

Prove that \(0 \leq JSD(P||Q) \leq 1\) bit for all P, Q. When are the bounds achieved? Solution Start here. What are the bounds for KL divergence? When are those bounds achieved? Solution Start here. The Kullback–Leibler divergence \(D_{KL}(P||Q)\) between \(P\) and \(Q\) is defined as:

\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right)\) The lower bound is reached when \(P(x) = Q(x)\) because \(\left(\frac{P(x)}{Q(x)}\right) = 1\):

\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2(1) = \sum_{x}P(x) 0 = 0\) The upper bound is reached when \(Q(x)\) is disjoint from \(P(x)\), i.e., \(Q(x)\) is zero where \(P(x)\) is not zero, because then the log-ratio \(\log_2\left(\frac{P(x)}{Q(x)}\right)\) becomes \(\infty\):

\(x_i \in x\)

\(P(x_i) \log_2\left(\frac{P(x_i)}{Q(x_i)}\right) = P(x_i) \log_2\left(\frac{P(x_i)}{0}\right) = \infty \to\)

\(D_{KL}(P||Q) = \sum_{x}P(x) \log_2\left(\frac{P(x)}{Q(x)}\right) = \infty\) In the paper, why do they say “In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, \(log(1 − D(G(z)))\) saturates”? Solution Understanding the vanishing generator gradients point in the GAN paper Implement a Colab that trains a GAN for MNIST. Try both the saturating and non-saturating discriminator loss. Solution An implementation can be found here.





3 The Paper

Motivation: Let’s read the paper. Keep in mind that InfoGAN modifies the original GAN objective in this way:

Split the incoming noise vector z into two parts - noise and code. The goal is to learn meaningful codes for the dataset. In addition to the discriminator, it adds another prediction head to the network that tries to predict the code from the generated sample. The loss is a combination of the normal GAN loss and the prediction loss. This new loss term can be interpreted as a lower bound on the mutual information between the generated samples and the code.

Topics:

The InfoGAN objective Why can’t we directly optimize for the mutual information \(I(c; G(z,c))\) Variational Information Maximization Possible choices for classes of random variables for dimensions of the code c

Reproduce:

Reproduce in a Notebook

Required Reading:

Optional Reading:

Questions: