Why Learn a Common Latent Space?

The latent space captures the features of the data (in our case, the data is sentences). So if it was possible to learn a space that would produce the same features when language A was fed to it as when language B is fed to it, it would be possible to have a translation between them. Since the model has learned the right ‘features’, encoding from language A’s encoder, and decoding using language B’s decoder would effectively be asking it to do a translation.

As you may have guessed , the authors utilised denoising auto-encoders to learn a feature space. They also figured out how to make the auto encoder learn a common latent space (they call it an aligned latent space)in order to perform unsupervised machine translation.

Denoising Auto-encoders In language

The authors used a Denoising encoder to learn the features in an unsupervised manner. The loss defined by them is:

Equation 1.0 Denoising Auto-Encoder loss

Explanation of Equation 1.0

l is the language(for this setup , there will be 2 possible languages) . x is the input. C(x) is just the result after adding noise to x. We will get to noise creating function C shortly. e() is the encoder, and d() is the decoder. The term at the end , with the Δ(x hat ,x) is the sum of cross entropy errors at the token level. Since we have an input sequence , and we get an output sequence , we need to make sure that every token is in the right order. Therefore such a loss is used. It can be thought of a multi label classification , where the ith token in the input is compared with the ith token in the output. A token is a single unit which cannot be broken further. In our case, it is a single word.

So, Equation 1.0 is the loss that will make the network minimze the difference between the output of it(when given a noisy input), and the original , untouched sentence.

The 𝔼 and ~ symbol notation.

The 𝔼 is the symbol for expectation. In this context, it means , the distribution of the inputs need to come from the language l, and the average of the loss is taken. It is just a mathematical formality , and the actual loss during implementation (sum of cross entropy) will be implemented as usual.

The ~ in particular , means “comes from a probability distribution of”.

I wont go into details here, but you can read about this notation in detail in Chapter 8.1 in the Deep Learning Book.

How to Add Noise

Unlike images , where its just possible to add floating point numbers to pixels to add noise, adding noise to language needs to be different. Therefore, the authors developed their own system to create noise. They denote their noise function as C() . It takes in the input sentence, and outputs the noisy version of that sentence.

There are two different ways to add noise.

First, it is possible to simply drop a word from the input with a probability of P_wd.

Secondly, each word can shift from its original position with this constraint

Here, σ means the shifted location of the ith token. So , Equation 2.0 means : “a token can shift from its position at most k tokens to the left or to the right”

The authors used a k value of 3 , and a P_wd value of .1

Cross Domain Training

In order to learn to translate between two languages , there should be some process to map an input sentence(in language A) to an output sentence (in language B). The authors call this cross domain training. First, an input sentence (x) is sampled. Then the translated output(y) is produced by using the model(M()) from the previous iteration. Putting it together we have y = M(x). After that, y is corrupted using the same noise function C() described above ,giving C(y). The encoder of language A is made to encode this corrupted version, and the decoder of Language B is made to decode the output from Language A’s encoder, and recreate a clean version of C(y) . The models are trained using the same sum of cross entropy error like in Equation 1.0.

Learning a Common Latent Space by Adversarial Training

So far , there has been no mention of how to learn the common latent space. The cross domain training mentioned above may somewhat help learn a space that is similar, but a stronger constraint to push the models to learn a similar latent space is required.

The authors used Adversarial Training. They used another model(called Discriminator) that takes the output of each of the encoders, and predict which language that encoded sentence belongs to. Then , the gradients from the discriminator are taken , and the encoder is also trained to fool the discriminator. This is conceptually no different than a standard GAN (Generative Adversarial Network). The Discriminator takes in the feature vector of each time step(because RNNs are used), and predicts which language it came from.

Putting it all together

The 3 different losses(autoencoder loss, translation loss , and discriminator loss) mentioned above are added together , and all the model weights are updated in one step.

Since this was a sequence to sequence problem , the authors used an LSTM network, with attention, i.e. there are two LSTM based autocoders , one for each language.

At a high level, there are 3 main steps to training this architecture. It follows an iterative training procedure. The training loop would look somewhat like this:

Obtain translation using encoder of Language A and Decoder of Language B Train each Auto-encoder to regenerate an uncorrupted sentence when given a corrupted sentence Improve the translation by corrupting the translation obtained in Step 1 , and recreating it. For this step the encoder of Language A , and Decoder of Language B are trained together (and also encoder of Language B and Decoder of Language A )

Note that even though step 2 and 3 are listed separately, the weights are updated for both of them together.

How to Jumpstart this Framework

As mentioned above, the model uses its own translation from the previous iteration to improve on its translation capabilities. Therefore, before the training loop begins, it is important to have some form of translation capability already. The authors used FastText , to learn word level bilingual dictionary. Note that this method is very naiive and required only to give the model a starting point.

The whole framework is given in the flowchart below