The variational autoencoder

Final thoughts and outlook

For Variational Autoencoders (VAEs) we assume there are underlying factors $z$ with an unknown distribution $p(z)$, which cause the data $x$.--> The factors generate/ cause the data with a distribution $p(x | z)$ which is also unknown to us.--> Our main goal is, given a data sample x, to find the factors which caused the sample, i.e. we want to know $p(z | x)$ (because given, the factors the can solve many different tasks, such as classification).Therefore we use a method called "variational inference". We define a density $q_\theta (z | x)$ (parameterized by theta) which we can evaluate, and we want so be similar to $p(z | x)$ ( so that in the best $q_\theta=p$ case to get z we simple have to evaluate our "surrogate" $q$ instead of $p$ ).Consequently we want to minimize the Kullback-Leibler divergence (KL-div) between p and q:$ \min_\theta D_{KL} ( q_\theta (z | x) || p ( z | x) )$.But now here comes the sore point. Since we don't know $p(z|x)$ we can't minimize the KL-div directly.Thus we will further investigate the KL-div. It is formulated as :$D_{KL} ( q_\theta (z | x) || p ( z | x) ) = \int_{z} q_\theta (z | x) \log \frac{q_\theta (z | x)}{p(z|x)} dz $,which can be rewritten as:$D_{KL} ( q_\theta (z | x) || p ( z | x) ) = \int_{z} q_\theta (z | x) \log \frac{q_\theta (z | x)}{p(z|x)} dz$$= \int_z q_\theta (z | x) \log q_\theta (z | x) dz - \int_z q_\theta (z | x) \log p(z|x) dz $$= \int_z q_\theta (z | x) \log q_\theta (z | x) dz - \int_z q_\theta (z | x) \log \frac{p(z,x)}{p(x)} dz$$= \int_z q_\theta (z | x) \log q_\theta (z | x) dz - \int_z q_\theta (z | x) \log p(z,x) dz + \int_z q_\theta (z | x) \log p(x) dz$.With$\int_z q_\theta (z | x) \log p(x) dz = \log p(x) \int_z q_\theta (z | x) dz= \log p(x) \times 1 = \log p(x) $,we can now express the KL-div as:$D_{KL} ( q_\theta (z | x) || p ( z | x) ) = $$= \int_z q_\theta (z | x) \log q_\theta (z | x) dz - \int_z q_\theta (z | x) \log p(z,x) dz + \log p(x)$.$= \mathbf{E}_{q_\theta (z | x)} [ \log q_\theta (z | x) ] - \mathbf{E}_{q_\theta (z | x)} [ \log p(z,x) ] + \log p(x)$.Here let us now define$ \mathbf{E}_{q_\theta (z | x)} [ \log q_\theta (z | x) ] - \mathbf{E}_{q_\theta (z | x)} [ \log p(z,x) ] = - \mathcal{L} $, where $\mathcal{L}$ is also referred to as the "variational" lower bound.As we now can see:$D_{KL} ( q_\theta (z | x) || p ( z | x) ) = \log p(x) - \mathcal{L} $.Remember we want to minimize the KL-div ( an thus make $q_\theta (z | x)$ similar to $p ( z | x)$ ).Here $\log p(x)$ is the (log) probability of a data sample, which is given by our training data (distribution), and thus have no influence on it and can more or less ignore it (for minimization).Now it becomes quite apparent, that in order to minimize the KL-div we simply have to maximize $\mathcal{L}$ .Lets now have a closer look at $\mathcal{L}$ :$ \mathcal{L} = - \mathbf{E}_{q_\theta (z | x)} [ \log q_\theta (z | x) ] + \mathbf{E}_{q_\theta (z | x)} [ \log p(z,x) ] $$ = - \mathbf{E}_{q_\theta (z | x)} [ \log q_\theta (z | x) - \log p(z,x) ] $$= - \int_z q_\theta (z | x) \frac{\log q_\theta (z | x)}{\log p(x,z)} dz$$= - \int_z q_\theta (z | x) \frac{\log q_\theta (z | x)}{\log p(x|z) p(z)} dz$$= - \int_z q_\theta (z | x) \frac{\log q_\theta (z | x)}{\log p(z)} - \log p(x|z) dz$$=- D_{KL}(q_\theta (z | x) || p(z)) + \mathbf{E}_ {q_\theta (z | x)} [ log p(x | z) ] $ .Now we can use a trick, often referrend to as reparameterization trick, where we now take a deterministic function $g_\phi(x, \epsilon) = z$, where $z$ is computed deterministically given $x$ and (random) noise $\epsilon$.This now allows us to approximate the 2nd term in $\mathcal{L}$ :$ \mathcal{L}^\mathcal{B} = - D_{KL}(q_\theta (z | x) || p(z)) + \frac{1}{L} \sum_{i=1}^L log(x | z^i)$ ,where each $z^i$ is computed using $g_\phi$ with a randomly drawn noise $\epsilon^i \sim p(\epsilon)$ (for different noise distribution see [1]) (In pratice $L=1$ shows good results).We can now maximize $\mathcal{L}$ by minimizing $D_{KL}(q_\theta (z | x) || p(z)) $ and maximizing $log(x | z)$.So far so good, up until now we have made hardly any approximations or assumptions, so the formulas are valid in most cases.Now lets see have we can coin/ transfer it to an autoencoder:At first lets us therefore now use a neural network to compute $z = g_\phi(x, \epsilon)$.We choose $q_\theta (z | x)$ to be Gaussian, thus $g_\phi(x, \epsilon)$, thus our network will output a mean $\mu$ and a variance $\sigma$ (in practice often $\log \sigma$).Now $z$ can be sampled from $z \sim \mathcal{N}(\mu, \sigma^2)$ with $z = g_\phi(x, \epsilon) = \mu + \sigma \times \epsilon$ , where $\epsilon$ is randomly drawn from $\mathcal{N}(0 , I)$Thus $g$ is now given as $g_\phi(x, \epsilon) = \mu + \sigma \times \epsilon$ , where $\mu = \mu(z)$ and $\sigma = \sigma(z)$ are deterministically computed by/ the output of a neural net given $x$ as input and $\epsilon$ is randomly drawn from a normal distribution $\epsilon \sim \mathcal{N}(0 ,I)$ with zero mean and unit variance.Now we can see how this part resembles the encoder part of an autoencoder.How can we use this now to evaluate $\log p(x | z)$.Therefore lets make a further assumption (this is quite similar to the previous part). We assume $p(x | z)$ also to be a (multivariate) normal distribution, i. e. $x \sim \mathcal{N}(\mu(z) , \sigma(z))$ , where $\mu$ and $\sigma$ are calculated given $z$.Lets now further assume $\sigma$ to be independend on $z$ : $\sigma(z) = \sigma$ --> $x \sim \mathcal{N}(\mu(z), sigma)$.Keep in mind that since we want to maximize $\mathcal{L}$.Since $p(x | z)$ is (or rather we chose it to be) a normal distribution, maximizing $p(x | z)$ is equivalent to minimizing $|| x - \mu(z) ||^2$ (since $p(x | z) = \frac{1}{Z} e ^{- \frac{||| x - \mu(z) ||^2}{2 \sigma^2}}$ ).Now we can simply express $\log p(x | z)$ by $- || x - \mu(z) ||^2 $ . All that we now left with is to calculate $\mu(z)$.Lets simply use a neural network again, which takes $z$ as input and outputs $ \mu(z) $ , which we shall call $x' = \mu(z)$.Here $\mu(z)$ is now equivalent to a deconder in an autoencoder, and as error function we simply choose the mean squared error (MSE) $|| \cdot || ^2$.Now making a few assumptions we have simply replaced $ \mathbf{E}_ {q_\theta (z | x)} [ log p(x | z) ] $ by an autoencoder (with noise $\epsilon$ added to the middle layer) and a mean squared error $|| x - x' ||^2$ which we want to minimize.-- > $ \max \mathcal{L}^\mathcal{AE} \iff \max - D_{KL}(q_\theta (z | x) || p(z)) - || x - x' ||^2$$ \iff \min D_{KL}(q_\theta (z | x) || p(z)) + || x - x' ||^2 $ .Now lets consider $D_{KL}(q_\theta (z | x) || p(z))$.Lets us now make the "last" assumption that $p(z)$ a normal distribution as well, with zero mean and unit variance $z ~ \sim \mathcal{N}(0, I)$.This allows us to calculate the KL-div analytically since we previously assumed $q_\theta (z | x)$ to be a normal distribution as well (see [1] for the complete derivation):$D_{KL}(q_\theta (z | x) || p(z)) = \frac{1}{2} \sum_{j=1}^{J} (1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 ) $.This gives us now our final minimization objective/ error function :$\min || x - x' ||^2 + \frac{1}{2} \sum_{j=1}^{J} (1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 )$.The last term can also be interpreted as a regularization term for the autoencoder, "pulling" the activation in the hidden layer towards zero (with a unit variance). This also bears some resemblance to sparse autoencoders.Here we have seen how to find a hidden/latent representation/ causal factors for our data, and have seen how we can use auto encoders to get these.But to stamp our nice "untouched" (variational) theories into autoencoders, we used a lot of restrictive conditions.One very nice approach which was recently published by Rosca et al. [2] tries to loosen a lot of these contrains, where they basically replace all our assumed normal distributions by GANs (using a trick they call "Density Ratio Trick", which basically uses the output of a discrimintator in a GAN as estimate of ration densities.)References:[1] Auto-Encoding Variational Bayes, Kingma et al. 2013, https://arxiv.org/pdf/1312.6114.pdf[2] Variational Approaches for Auto-Encoding Generative Adversarial Networks, Rosca et al. , https://arxiv.org/pdf/1706.04987v1.pdf