Today I am going to discuss a recent paper which I read and presented to some of my friends. I found the idea of the paper so simple that I felt it could be understood by anybody with little knowledge of Generative Adversarial Networks like me.

This is a recent paper by the GANfather, Ian Goodfellow, and a few other great researchers. The paper is titled Self- Attention Generative Adversarial Networks or SAGAN for short.

Most of the good generated image results of GANs come by utilizing a monster named Convolutional Neural Networks(CNN). Like all monsters, this one also has some weaknesses which we will discuss further. Most of the good GAN generated images we have seen are a single class or very few classes. The below image is trained only on celebrity faces dataset.

Source: (YouTube) Progressive growing of GANs

The Problem

Convolutional GANs( let’s call it CGANs for short in future) have difficulty in learning the image distributions of diverse multi-class datasets like Imagenet. Researchers observed that these CGANs have difficulty in modeling some image classes than others when trained on multi-class datasets. They observed that CGANs could easily generate images with a simpler geometry like Ocean, Sky etc. but failed on images with some specific geometry like dogs, horses and many more. The CGAN was able to generate the texture of furs of dog but was unable to generate distinct legs.

Why is this problem arising?

This problem is arising because the convolution is a local operation whose receptive field depends on the spatial size of the kernel. In a convolution operation, it is not possible for an output on the top-left position to have any relation to the output at bottom-right. In the image below we can see that the output “-8” is computed by the top-left pixels of the image and it has no relation to any other part in the image. Similarly, when computing any part of the convolution output it has no relation to any other part except for a small local region in the image from where the output is computed.

You would ask, Can’t we make the spatial size bigger so that it captures more of the image? Yes! of course we can, but it would decrease computational efficiency achieved by smaller filters and make the operation slow. Then you would again ask, Can’t we make a Deep CGAN with smaller filters so that the later layers have a large receptive field? Yes! we can, but it would take too many layers to have a large enough receptive field and too many layers would mean too many parameters. Hence it would make the GAN training more unstable.

Self-Attention GANs

The solutions to keeping computational efficiency and having a large receptive field at the same time is Self-Attention. It helps create a balance between efficiency and long-range dependencies(= large receptive fields) by utilizing the famous mechanism from NLP called attention.

What is this Attention?

It is one of the simplest things to understand. Let’s suppose there are 3 persons named query, key and value. Attention is when the query and key decide how much the value can speak to the outer world. In Deep learning every thing is a vector so the three persons were actually three vectors. Query and key multiply in such a way so that they create another vector of probabilities which decide how much of the value to expose to the next layer. And yes that’s what attention is in its entirety.

Let’s understand it using a diagram. The diagram is self-explanatory. The Q(query) and K(key) undergo a matrix multiplication then passes through a softmax which converts the resultant vector into probability distribution and then it finally gets multiplied by V(value).

Then what is this Self-Attention thing? In self-attention, the query, the key and, the value are all same.

The Model

If you understood the above part then understanding this is a piece of cake. This is a self-attention layer proposed in the paper. On the left of the below image, we get our feature maps from the previous convolutional layer. Let’s suppose it is of the dimension (512 x 7x 7) where 512 is the number of channels and 7 is the spatial dimension. We first pass the feature map through three 1x1 convolutions separately. We name the three filters f, g and, h.

Source: SAGAN paper

What 1x1 convolution does is that it reduces the number of channels in the image. Actually a 1 x 1 filter is a (#channels in previous layer x 1 x 1) dimension. So the f and g have 64 those filters, so the dimension of the filter becomes (64 x 512 x 1 x1). The h has 512 of those filters. After the image gets passed through we get three feature maps of dimensions (64 x 7 x 7), (64 x 7 x 7) and (512 x 7 x 7). Guess what are these three things our query, key and value pairs. In order to perform self attention on the complete image we flatten out last two dimensions and the dimensions become (64 x 49), (64 x 49) and (512 x 49). Now we can perform the self-attention over it. We transpose the query and matrix-multiply it by the key and take the softmax on all the rows. So we get an output attention map of shape (49 x 49). Then we matrix multiply the value vector with the attention map the output is of the shape (512 x 49). One last thing that the paper proposes is that to multiply the final output by a learnable scale parameter and add back the input as a residual connection. Let’s say the x was the image and o is the output, we multiply o by a parameter y. The final output O then becomes, O=y*o+x. Initially, the paper advises initializing the scale parameter as zero so that the output is simple old convolution at the beginning. They initialized y with zero because they wanted their network to rely on the cues in the local neighborhood — since that was easier and then gradually the network will learn to assign the value to y parameter and use self-attention.

This layer helps the network capture the fine details from even distant parts of the image and remember, it does not replace convolution rather it is complementary to the convolution operation.

Source: SAGAN paper

The loss function used is just the hinge version of the adversarial loss. The paper doesn’t explain anything about the specific use of this specific loss function.

Here z is the latent vector from which the image will be generated and x and y are the real images from the data. The generator loss tells about creating more and more realistic images by fooling the discriminator and the discriminator on the other hand tries to become better in classifying the real and the fake images.

Few Details from the paper

a) They used this self-attention layer in both the generator and discriminator

b) They applied spectral normalization to the weights in both generator and discriminator, unlike the previous paper which only normalizes the discriminator weights. They set the spectral norm to 1 to constrain the Lipschitz constant of the weights. It’s just used for controlling the gradients. This spectral normalization idea was first introduced by Miyato et. al.

c) They used a two-timescale update rule (TTUR) which is simply using different learning rate for both discriminator and generator.

d) The metrics used in the paper are Inception Score (IS, higher is better) and Frechet-Inception Distance(FID, lower is better).

Results

The paper explains with experiments how the Spectral Normalization and TTUR have helped the GAN to converge better. A picture of the same is shown below.

Source: SAGAN Paper

We can see the evaluation metrics IS and FID in all the three cases. The training is very unstable when the spectral norm is only on the discriminator weights. Even when we apply spectral norm on both generator and discriminator the scores deviate at around 200k iteration. But with TTUR it does not happen.

The best part about the paper is its results and it beats the previous state-of-the-art by a great margin.

Source: SAGAN paper

And finally let’s see the generated images by the Self-Attention GANs.

Source: SAGAN Paper

Thank you so much for reading till here. If you liked the article please leave some claps. I hope that you find it useful. Please read the paper for further details and believe me it’s an easy read.