In the second part of this series, we looked at methods to combat the non-differentiability issue in text generation GANs using Reinforcement Learning (RL). In case you’re wondering what this issue of non-differentiability is, I suggest you look at the first part of the series where I discuss this in detail.

However, as I mentioned at the end of the previous part, RL-based methods have several issues:

High variance gradient estimates which make the training process unstable and slow. Convergence to highly sub-optimal local minima. Extremely large state-action space which leads to large portions of the space remaining unexplored (related to the previous point). And generally very poor sentence quality due to a combination of the above factors.

With these issues in mind, researchers have been looking for efficient ways to train text GANs that don’t rely on RL. And while it’s difficult to say if they have achieved any significant improvement in terms of the quality of generated samples, it is instructive to look at these methods and gain an alternative perspective on how to approach this problem.

Trending AI Articles:

Let’s dive in.

Method 1: Using the Gumbel-softmax distribution

This method is based on the ideas proposed in “GANs for Sequences of Discrete Elements with the Gumbel-softmax Distribution”.

Let’s briefly revisit the “differentiability problem” that we discussed in the first part of the series.

During the decoding process of the generator, we need to sample a word from the multinomial probability distribution generated by the softmax function at every time-step. Now, the issue is that the way we were sampling from the distribution was by simply picking the word corresponding to the maximum probability i.e. argmax(). This is a non-differentiable operation.

To find a solution to this, let’s first formulate our problem as follows. We want to generate a |V|-dimensional one-hot vector y given the |V|-dimensional vector of unnormalized scores, h (this is typically the hidden state of the RNN). The standard way of “sampling” from this is to generate a vector of probabilities p as:

And then pick the word corresponding to the maximum probability (notice how this isn’t even “sampling” because it’s deterministic). Instead, let’s look at another method of sampling from p. It has been shown that sampling y based on p is equivalent to sampling y as:

where h is as defined above and g is a vector of samples drawn independently from a Gumbel Distribution. But again, argmax() isn’t differentiable. So we approximate it using softmax(), but with an additional temperature parameter, as follows:

Now, since this is just a softmax function, it looks like it’s just going to give us a probability distribution again. So, how does this help us get a one-hot vector?

Notice that as τ → 0, y will be a sufficiently close approximation to a one-hot vector. So, we now have a way of sampling one-hot vectors that is also differentiable with respect to the hidden vector (h)!

Now, to train a generator for discrete data that uses the above sampling function, we can start with a large value of τ and gradually anneal it to 0 as the training progresses. This way, the output of the generator will eventually become a sequence of one-hot vectors which can be fed to the discriminator directly.

Method 2: Using an Auto-encoder

This method is based on the paper “Adversarial Text Generation Without Reinforcement Learning”

Auto-encoders

For those of you who are new to the concept of an auto-encoder, it is a network that consists of two parts — the encoder and the decoder. Given an input, the job of the encoder is to project it to a space of (much) smaller dimensionality, and the job of the decoder is to reconstruct the input given this compressed representation. As you might suspect, auto-encoders are useful for learning representations that capture more general features, as they have to compress the input to a smaller dimensionality.

An auto-encoder

Connecting them to GANs

Now, let’s look at how we can actually use auto-encoders to solve our original problem.

Let’s look at the problem of text generation from a completely different perspective. So far, we’ve been looking at the process of text generation as the sequential generation of discrete tokens (words). The sampling of these tokens from discrete spaces is causing the entire problem. So how about we throw out discrete spaces altogether?

How would that work?

Say we want to generate a single sentence. Instead of looking at sentence generation as sampling a sequence of tokens, we look at it as sampling a single ‘sentence vector’ from the continuous space of all sentence vectors. And instead of the decoding the sentence vector into a sequence of discrete tokens like we did previously, we continue to work with the continuous sentence vector. Simply put, our generator is now responsible for generating sentence vectors and not human readable sentences.

Great, so now we need our generator to generate sentence vectors that look “real” (decode to a valid english sentence) and for our discriminator to be able to discriminate between “real” and “fake” (decode to an invalid sentence) sentence vectors. Where do we get these “real” sentence vectors from?

This is where our auto-encoder comes in. Before we start working with the GAN, we first train an auto-encoder on a large corpus of real sentences. Then, while training the GAN, to get “real” samples we input real sentences to the encoder of the auto-encoder and get the corresponding sentence vectors. The “fake” samples are just the sentence vectors output by the generator network of the GAN.

Once the network is trained to its optimum, to generate sentences we use the generator network to obtain sentence vectors which are then decoded by the decoder of the auto-encoder to obtain human readable sentences.

The following figure from the paper summarizes the entire architecture.

Figure taken from the paper “Adversarial Text Generation Without Reinforcement Learning”

Other works

Although there have been other papers that attempt to address the same issue without using Reinforcement Learning, they are all usually built upon one of the above two ideas. That is —

Using Gumbel-softmax for a continuous approximation of softmax. Working with the continuous output of the generator directly.

Examples of such papers —

“RelGAN: Relational Generative Adversarial Networks for Text Generation” (Nie et. al.) which is based on idea 1 along with certain modifications in the generator to model long-term dependencies in the text effectively. “Adversarial Generation of Natural Language” (Rajeswar et. al.) which was heavily criticized by Yoav Goldberg in this famous post. This paper is based on idea 2.

Wrapping it up

With that we come to end of the final part of this 3 part series on text GANs. Through this series, I’ve tried to provide an overview as well as a conceptual understanding of the different ideas that people are trying in order to train text GANs effectively. So if you found any of the ideas interesting, I encourage you to dive deeper, read the referenced papers and try implementing them. Research in the general area of adversarial text generation is still at a fairly nascent stage, so the problems are hard and exciting and the potential is huge.

Hope you enjoyed the series!

(I am open to feedback as well as requests for future articles/series on any specific topics. So feel free to hit me up!)

Don’t forget to give us your 👏 !