Conditional language models

Recall that a language model assigns a probability to a sequence of words. A conditional language model is a generalization of this idea: it assigns probabilities to a sequence of words given some conditioning context (x):

Let’s look at two examples.

Example 1: Neural Machine Translation

Machine translation is exactly what is sounds like: automatically translating from one language to another. In 2014 the field was rocked by a new approach. In the blink of an eye decades of research were overturned by a new technique known as neural machine translation (NMT).

NMT uses a single neural network comprised of two RNNs:

Encoder RNN: Extracts all of the pertinent information from the source sentence to produce an encoding

Decoder RNN: A language model that generates the target sentence conditioned with the encoding created by the encoder

This architecture is known as a sequence2sequence model (or simply seq2seq for those studious of brevity). It is trained on sample pairs of the source language and its translation. Crucially, it is trained “end to end” via backpropagation as a single system — no more need for hand crafted rules and intricate linguistic knowledge. (To the Phrasee linguistics team: this is an oversimplification of course! We need you more than ever.)

The “decoder” is a conditional language model. The output is based on the sequence generated so far and the original text to be translated:

Credit: this picture has been adapted from the excellent lectures notes from the Stanford course Natural Language Processing with Deep Learning (CS224n)

The decoder is trained with a method called “teacher forcing”. The target sequence is the input sequence offset by one. It is learning to predict the word that comes next.

Note that we are no longer generating random nonsense! NMT generates text with meaning. Text with purpose. The kind of text you could see yourself reading.

Example 2: Image captioning

For the image captioning problem we have:

Input: An image

Output: Text describing the image

The encoder extracts key features from the image. For example, using a convolutional neural network (CNN). These features are stored as a compact encoding that is used to condition the language model:

How is information transferred from the encoder to the decoder?

In NMT and image captioning the encoder creates a fixed-length encoding (a vector of real numbers) that encapsulates information about the input. This representation has several names:

embedding

latent vector

meaning vector

thought vector

Here is the key: the embedding becomes the initial state of the decoder RNN. Read that again. When the decoding process starts it has, in theory, all of the information that it needs to generate the target sequence.

Once you understand this the sky is the limit. Here are a few more examples of applications for conditional language models:

Conditioning with word vectors

At Phrasee we do something a little different: we condition our language models with word embeddings. A word embedding is a dense vector of real numbers:

Word embeddings have the following desirable qualities:

they are fixed-length, which is convenient for machine learning algorithms

word embeddings capture semantic information about words (e.g. synonyms will be close in the vector space)

word embeddings are easy and fast to compute

word embeddings can be combined using vector arithmetic. The classic example: “king” − “man” + “woman” = a vector that is pretty close to “queen”.

word embeddings can be combined to build up more complex concepts that don’t correspond to a single word

Here is what our model looks like:

Note that in this case there is no encoder; the decoder is conditioned directly with the word embedding. It would be an easy modification to turn this into an end to end system where encoder embeddings are also learned. However, in our case we have an external source of embeddings that we would like to use.