Throughout this post, when I say “network consumes a sequence of words” or “words are passed to RNN,” I mean that word embeddings are passed to the network, not word ids.

Note on dialogue data representation

Before going deeper, we should discuss what dialogue datasets look like. All models described below are trained on pairs (context, reply). Context is several sentences (maybe one) which preceded the reply. The sentence is just a sequence of tokens from its vocabulary.

For better understanding, look at the table. There is a batch of three samples extracted from raw dialogue between two persons:

- Hi!

- Hi there.

- How old are you?

- Twenty-two. And you?

- Me too! Wow!

Note the “<eos>” (end-of-sequence) token at the end of each sentence in the batch. This special token helps neural networks to understand sentence bounds and update its internal state wisely.

Some models may use additional meta information from data, such as speaker id, gender, emotion, etc.

Now, we are ready to move on to discussing generative models.

Generative models

We start with the simplest conversational model, based on the paper “A Neural Conversational Model.”

For modeling dialogue, this paper deployed a sequence-to-sequence (seq2seq) framework which emerged in the neural machine translation field and was successfully adapted to dialogue problems. The architecture consists of two RNNs with different sets of parameters. The left one (corresponding to A-B-C tokens) is called the encoder, while the right one (corresponding to <eos>-W-X-Y-Z tokens) is called the decoder.

How does the encoder work?

The encoder RNN conceives a sequence of context tokens one at a time and updates its hidden state. After processing the whole context sequence, it produces a final hidden state, which incorporates the sense of context and is used for generating the answer.

How does the decoder work?

The goal of the decoder is to take context representation from the encoder and generate an answer. For this purpose, a softmax layer over vocabulary is maintained in the decoder RNN. At each time step, this layer takes the decoder hidden state and outputs a probability distribution over all words in its vocabulary.

Here is how reply generation works:

Initialize decoder hidden state with final encoder hidden state (h_0). Pass <eos> token as first input to the decoder and update hidden state (h_1) Sample (or take one with max probability) first word (w_1) from softmax layer (using h_1). Pass this word as input, update hidden state (h_1 -> h_2) and generate new word (w_2). Repeat step 4 until <eos> token is generated or maximum answer length is exceeded.

Reply generation in decoder, for those who prefers formulas instead of words. Here, w_t is the sampled word on time step t; theta are decoder parameters, phi are dense layers parameters, g represents dense layers, p-hat is a probability distribution over vocabulary at time step t.

Using argmax while generating a reply, one will always get the same answer when utilizing the same context (argmax is deterministic, while sampling is stochastic).

The process I’ve described above is only the model inference part, but there is also the model training part, which works in a slightly different way — at each decoding step, we use the correct word y_t instead of the generated one (w_t) as the input. In other words, at training time, the decoder consumes a correct reply sequence, but with the last token removed and the <eos> token prepended.

Illustration of decoder inference phase. Output at previous time step is fed as input at current time step.

The goal is to maximize probability of a correct next word on each time step. More simply, we ask the network to predict the next word in the sequence by providing it with a correct prefix. Training is performed via maximum likelihood training, which leads to classical cross-entropy loss:

Here, y_t is a correct word in reply at time step t.

Modifications of generative models

Now we have a basic understanding of sequence-to-sequence framework. How do we add more generalization power to such models? There are a bunch of ways:

Add more layers to encoder or/and decoder RNNs. Use a bidirectional encoder. There is no way to make the decoder bidirectional due to its forward generation structure. Experiment with embeddings. You can pre-initialize word embeddings or learn them from scratch together with the model. Use a more advanced reply generation procedure — beamsearch. The idea is to not generate a reply “greedily” (by taking argmax for the next word) but consider the probability of longer chains of words and choose among them. Make your encoder or/and decoder be convolutional. Convnets might work much faster than RNNs because they can be parallelized efficiently. Use an attention mechanism. Attention was initially introduced in neural machine translation papers, and has become a very popular and powerful technique. Pass the final encoder state at each time step to the decoder. The decoder sees the final encoder state only once and then may forget it. A good idea is to pass it to the decoder along with word embedding. Different encoder/decoder state sizes. The model I described above requires the encoder and decoder to have the same hidden state size (because we initialize the decoder state with the final encoder’s state). You can get rid of this requirement by adding a projection (dense) layer from the encoder final state to the initial decoder state. Use characters instead of words or byte pair encoding for building vocabulary. Character-level models are worth considering as they work faster because of a smaller vocabulary and they can understand words which are not in their vocabulary. Byte Pair Encoding (BPE) is the best of both worlds. The idea is to find the most frequent pairs of tokens in a sequence and merge them into one token.

Problems with generative models

Later, I’ll give you links to popular implementations so you can train your own dialogue models. But now I’d like to warn you of some common problems with generative models you can face.

Generic responses

Generative models trained via maximum likelihood tend to predict high probability for general replies such as “Okay,” “No,” “Yes,” and “I don’t know” for a wide range of contexts. There are some works dealing with this problem by:

Reply inconsistency / how to incorporate metadata

The second major problem with seq2seq models is that they can generate inconsistent replies for paraphrased contexts but with the same sense:

The most cited work dealing with it is “A Persona-Based Neural Conversation Model.” Authors used speaker ids for each utterance in order to generate an answer, which conditioned not only on encoder state, but also on speaker embedding. Speaker embeddings are learned from scratch along with the model.

Using this idea, you can augment your model with the different metadata you have. For example, if you know the tense of utterance (past/present/future), you can generate replies in different tenses at inference time! You can adjust the personality of the replier (gender, age, mood) or reply properties (tense, sentiment, question/not question, etc.) while you have such data to train models on.

For your practice

I promised you links to seq2seq models implementations in different frameworks, and here they are.

TensorFlow

Google’s official implementation

Two more implementations which you may find more comfortable to work with PyTorch (seq2seq for translation, but you can use the same code for dialogues)

Translation with seq2seq (you may use the same code but with dialogue data)

Implementation from IBM

Keras

Papers & guides

Diving into selective models

Getting done with generative models, let’s understand how selective neural conversational models work (they are often referred to as DSSM, which stands for deep semantic similarity model).

Instead of estimating probability p(reply | context; w), selective models learn similarity function — sim(reply, context; w), where a reply is one of the elements in a predefined pool of possible answers (see illustration below).

The intuition is that the network takes context and a candidate reply as inputs and returns the confidence of how appropriate they are to each other.

The selective (or ranking, or dssm) network consists of two “towers”: the first for the context and the second for the reply. Each tower may have any architecture you want. The tower takes its input and embeds it in semantic vector space (vectors R and C on the illustration). Then, the similarity between context and reply vectors is computed, i.e. using cosine similarity C^T*R/(||C||*||R||).

At inference time, we can calculate the similarity between given context and all possible answers and choose the one with maximum similarity.

In order to train the model, we use triplet loss. Triplet loss is defined on triplets (context, reply_correct, reply_wrong) and is equal to: