Adversarial Variational Bayes in Pytorch¶

In the previous post, we implemented a Variational Autoencoder, and pointed out a few problems. The overlap between classes was one of the key problems. The normality assumption is also perhaps somewhat constraining.

In this post, I implement the recent paper Adversarial Variational Bayes, in Pytorch. This addresses some of the issues with VAEs, and also provides some interesting links to GANs, the other popular approach to generative modelling.

Before diving in, if you are not familiar with VAEs, I suggest you take some time to recap my previous post on the theory of VAEs.

The main modification proposed by the AVB paper, is to change the encoder from being a parameterized Gaussian to being a fully implicit distribution. In a vanilla VAE, the encoder network takes a data point, x, as input and outputs the mean and variance of a normal distribution. We then used the reparameterization trick on standard normal samples to sample values of the latent variables.

The first modification is that instead of outputting a mean and variance, the encoder now returns a z value directly. The inputs are now data points, x, and standard Gaussian noise. So our encoder network learns how to incorporate the random noise to generate a sample from the approximate posterior directly.

This is most easily shown by figure 2 in the original paper.

Implicit Likelihood Ratio¶

Once we have made the above change, we have a problem. The ELBO:

$max_{\phi}max_{\theta} \ E_{p_{D}}[ \mathcal{L}] = max_{\phi, \theta} E_{p_{D}}[E_{q}[log \ P_{\theta}(X \mid Z) + log \ P_{\theta}(Z) - log \ q_{\phi}(Z \mid X)]] $

contains probabilities under the approximate posterior, q. Now we have an implicit model for q, we can't evaluate the probability of a sample. In order to deal with this problem, we use the idea from Learning in Implicit Generative Models, covered in a previous post. The idea is to use a discriminator to approximate the log ratio of q and the prior.

We introduce a second network, $T(x,z)$, which outputs a single value for each sample. If we label the samples from the posterior as class 1, and from the prior as class 0, we can then pass the output of T through a sigmoid, and then train it using binary cross entropy, in exactly the same way as with my previous post on Discriminators as likelihood ratios. In this case, the output of T (without the sigmoid) are the logits, which at optimality of the discriminator is the ratio of the two distributions. We can therefor substitute the output of T directly for the ratio term in the ELBO, $log \ p(Z) - log \ q_{\phi}(Z\mid X) $.

This gives us two loss functions that we optimise in an iterative process. The discriminator is trained by minimising the binary cross entropy, and the encoder and decoder is trained by maximising the ELBO, but with the discriminator's estimate in place of an analytical $log \ p(Z) - log \ q_{\phi}(Z\mid X) $.

$max \ \mathcal{L_{D}} = E_{P_{D}}( E_{q(z \mid x)}[log \ \sigma (T(X, Z))] + E_{p(z)}[log \ 1 - \sigma (T(X, Z))] )$

$max \ \mathcal{L_{G}} = E_{p_{D}}[E_{q}[log \ P_{\theta}(X \mid Z) - T(X, Z)]] $

In practice¶

I continue with the example I used in the VAE post, the toy example that is included in the post.