This article is the first installment of a two-post series on Building a machine reading comprehension system using the latest advances in deep learning for NLP. Stay tuned for the second part, where we'll introduce a pre-trained model called BERT that will take your NLP projects to the next level!

In the recent past, if you specialized in natural language processing (NLP), there may have been times when you felt a little jealous of your colleagues working in computer vision. It seemed as if they had all the fun: the annual ImageNet classification challenge, Neural Style Transfer, Generative Adversarial Networks, to name a few. At last, the dry spell is over, and the NLP revolution is well underway! It would be fair to say that the turning point was 2017, when the Transformer network was introduced in Google's Attention is all you need paper. Multiple further advances followed since then, one of the most important ones being BERT - the subject of our next article.

To lay the groundwork for the Transformer discussion, let's start by looking at one of the common categories of NLP tasks: the sequence to sequence (seq2seq) problems. They are pretty much exactly what their name suggests: both the inputs and the outputs of a seq2seq task are sequences. In the context of NLP, there are typicaly additional restrictions put in place:

The elements of the sequence are tokens corresponding to some set vocabulary (often including an Unknown token for the out-of-vocabulary words)

The order inside the sequence matters.

Next we shall take a moment to remember the fallen heros, without whom we would not be where we are today. I am, of course, referring to the RNNs - Recurrent Neural Networks, a concept that became almost synonymous with NLP in the deep learning field.

1. The predecessor to Transformers: the RNN Encoder-Decoder

This story takes us all the way back to 2014 (Ref, another Ref), when the idea of approaching seq2seq problems via two Recurrent Neural Networks combined into an Encoder-Decoder model, was born. Let's demonstrate this architecture on a simple example from the Machine Translation task. Take a French-English sentence pair, where the input is "je suis étudiant" and the output "I am a student". First, "je" (or, most likely, a word embedding for the token representing "je"), often accompanied by a constant vector h E0 which could be either learned or fixed, gets fed into the Encoder RNN. This results in the output vector h E1 (hidden state 1), which serves as the next input for the Encoder RNN, together with the second element in the input sequence "suis". The output of this operation, h E2 , and "étudiant" are again fed into the Encoder, producing the last Encoded hidden state for this training sample, h E3 . The h E3 vector is dependent on all of the tokens inside the input sequence, so the idea is that it should represent the meaning of the entire phrase. For this reason it is also referred to as the context vector. The context vector is the first input to the Decoder RNN, which should then generate the first element of the output sequence "I" (in reality, the last layer of the Decoder is typically a softmax, but for simplicity we can just keep the most likely element at the end of every Decoder step). Additionally, the Decoder RNN produces a hidden state h D1 . We feed h D1 and the previous output I back into the Decoder to hopefully get "am" as our second output. This process of generating and feeding outputs back into the Decoder continues until we produce an <EOS> - the end of the sentence token, which signifies that our job here is done.

The RNN Encoder-Decoder model in action. To avoid any confusion, there is something that I would like to draw your attention to. The multiple RNN blocks appear in the Figure because of the multiple elements of the sequence that get fed into / generated by the networks, but make no mistake - there is only one Encoder RNN and one Decoder RNN at play here. It may help to think of the repeated blocks as the same RNN at different timesteps, or as multiple RNNs with shared weights, that are envoked one after another.

This architecture may seem simple (especially until we sit down to actually write the code with LSTMs or GRUs thrown in for good measure), but it actually turns out to be remarkably effective for many NLP tasks. In fact, Google Translate has been using it under the hood since 2016. However, the RNN Encoder-Decoder models do suffer from certain drawbacks:

1a. First problem with RNNs: Attention to the rescue

The RNN approach as described above does not work particularly well for longer sentences. Think about it: the meaning of the entire input sequence is expected to be captured by a single context vector with fixed dimensionality. This could work well enough for "Je suis étudiant", but what if your input looks more like this:

"It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not."

Good luck encoding that into a context vector! However, there turns out to be a solution, known as the Attention mechanism.

Schematics of (left) a conventional RNN Encoder-Decoder and (right) an RNN Encoder-Decoder with Attention

The basic idea behind Attention is simple: instead of passing only the last hidden state (the context vector) to the Decoder, we give it all the hidden states that come out of the Encoder. In our example that would mean h E1 , h E2 and h E3 . The Decoder will determine which of them gets attended to (i.e., where to pay attention) via a softmax layer. Apart from adding this additional structure, the basic RNN Encoder-Decoder architecture remains the same, yet the resulting model performs much better when it comes to longer input sequences.

1b. Second problem with Recurrent NNs: they are (surprise!) Recurrent

The other problem plaguing RNNs has to do with the R inside the name: the computation in a Recurrent neural network is, by definition, sequential. What does this property entail? A sequential computation cannot be parallelized, since we have to wait for the previous step to finish before we move on to the next one. This lengthens both the training time, and the time it takes to run inference.

One of the ways around the sequential dilemma is to use Convolutional neural networks (CNNs) instead of RNNs. This approach has seen its share of success, until it got outshone by the <drumroll> ...

2. Attention is All You Need (c) Google, 2017

The Transformer architecture was introduced in the paper whose title is worthy of that of a self-help book: Attention is All You Need. Again, another self-descriptive heading: the authors literally take the RNN Encoder-Decoder model with Attention, and throw away the RNN. Attention is all you need! Well, it ends up being quite a bit more complicated than that in practice, but that is the basic premise.

How does this work? To start with, each pre-processed (more on that later) element of the input sequence w i gets fed as input to the Encoder network - this is done in parallel, unlike the RNNs. The Encoder has multiple layers (e.g. in the original Transformer paper their number is six). Let us use h i to label the final hidden state of the last Encoder layer for each w i . The Decoder also contains multiple layers - typically, the number is equal to that of the Encoder. All of the hidden states h i will now be fed as inputs to each of the six layers of the Decoder. If this looks familiar to you, it is for a good reason: this is the Transformer's Encoder-Decoder Attention, which is rather similar in spirit to the Attention mechanism that we discussed above. Before we move on to how the Transformer's Attention is implemented, let's discuss the preprocessing layers (present in both the Encoder and the Decoder as we'll see later).

There are two parts to preprocessing: first, there is the familiar word embedding, a staple in most modern NLP models. These word embeddings could be learned during training, or one could use one of the existing pre-trained embeddings. There is, however, a second part that is specific to the Transformer architecture. So far, no where have we provided any information on the order of the elements inside the sequence. How can this be done in the absence of the sequential RNN architecture? Well, we have the positions, let's encode them inside vectors, just as we embedded the meaning of the word tokens with word embeddings. The resulting post-processed vectors, carrying information about both the word's meaning and its position in the sentence, are passed on to the Encoder and Decoder layers.

An Encoder with two layers, processing a three element input sequence (w1, w2, and w3) in parallel. Each input element's Encoder also receives information about the other elements via its Self-Attention sublayers, allowing the relationships between words in the sentence to be captured.

2a. Attention, the linear algebra prospective

I come from a quantum physics background, where vectors are a person's best friend (at times, quite literally), but if you prefer a non linear algebra explanation of the Attention mechanism, I highly recommend checking out The Illustrated Transformer by Jay Alammar.

Let's use X to label the vector space of our inputs to the Attention layer. What we want to learn during training are three embedding matrices, W K , W V and W Q , which will permit us to go from X to three new spaces: K (keys), V (values) and Q (queries):

K = X W K V = X W V Q = X W Q

The way that these embedded vectors are then used in the Encoder-Decoder Attention is the following. We take a Q vector (a query, i.e., we specify the kind of information that we want to attend to) from the Decoder. Additionally, we take vectors V (values) that we can think of as something similar to linear combinations of vectors X coming from the Encoder (do not take "linear combination" literally however, as the dimensionality of X and V is, in general, different). Vectors K are also taken from the Encoder: each key K n indexes the kind of information that is captured by the value V n .

To determine which values should get the most attention, we take the dot product of the Decoder's query Q with all of the Encoder's keys K. The softmax of the result will give the weights of the respective values V (the larger the weight, the greater the attention). Such mechanism is known as the Dot-product attention, given by the following formula:

where one can optionally divide the dot product of Q and K by the dimensionality of key vectors d k . To give you an idea for the kind of dimensions used in practice, the Transformer introduced in Attention is all you need has d q =d k =d v =64 whereas what I refer to as X is 512-dimensional.

2b. What is new: Self-Attention

In addition to the Encoder-Decoder Attention, the Transformer architecture includes the Encoder Self-Attention and the Decoder Self-Attention. These are calculated in the same dot-product manner as discussed above, with one crucial difference: for self-attention, all three types of vectors (K, V, and Q) come from the same network. This also means that all three are associated with the elements of the same sequence (input for the Encoder and output for the Decoder). The purpose of introducing self-attention is to learn the relationships between different words in the sentence (this function used to be fulfilled by the sequential RNN). One way of looking at it is a representation of each element of the sequence as a weighted sum of the other elements in the sequence. Why bother? Consider the following two phrases:

1. The animal did not cross the road because it was too tired.

2. The animal did not cross the road because it was too wide.

Clearly, it is most closely related to the animal in the first phrase and the road in the second one: information that would be missing if we were to use a uni-directional forward RNN! In fact, the Encoder Self-Attention, that is bi-directional by design, is a crucial part of BERT, the pre-trained contextual word embeddings, that we shall discuss later on.

Where are the calculations for the Encoder Self-Attention carried out? Turns out, inside every Encoder layer. This permits the network to pay attention to relevant parts of the input sequence at different levels of abstraction: the values V of the lower Encoder layers will be closest to the original input tokens, whereas Self-Attention of the deeper layers will involve more abstract constructions.

2c. Putting it all together

By now we have established that Transformers discard the sequential nature of RNNs and process the sequence elements in parallel instead. We saw how the Encoder Self-Attention allows the elements of the input sequence to be processed separately while retaining each other's context, whereas the Encoder-Decoder Attention passes all of them to the next step: generating the output sequence with the Decoder. What happens at this stage may not be so clear. As you recall, the RNN Encoder-Decoder generates the output sequence one element at a time. The previously generated output gets fed into the Decoder at the subsequent timestep. Do Transformers really find a way to free us from the sequential nature of this process and somehow generate the whole output sequence at once? Well - yes and no. More precisely, the answer is [roughly] yes when training, and no at inference time.

The Transformer architecture featuting a two-layer Encoder / Decoder. The Encoder processes all three elements of the input sequence (w1, w2, and w3) in parallel, whereas the Decoder generates each element sequentially (only timesteps 0 and 1, where the output sequence elements v1 and v2 are generated, are depicted). Output token generation continues until an end of the sentence token <EOS> appears.

The inputs to the Decoder come in two varieties: the hidden states that are outputs of the Encoder (these are used for the Encoder-Decoder Attention within each Decoder layer) and the previously generated tokens of the output sequence (for the Decoder Self-Attention, also computed at each Decoder layer). Since during the training phase, the output sequences are already available, one can perform all the different timesteps of the Decoding process in parallel by masking (replacing with zeroes) the appropriate parts of the "previously generated" output sequences. This masking results in the Decoder Self-Attention being uni-directional, as opposed to the Encoder one. Finally, at inference time, the output elements are generated one by one in a sequential manner.

Some final remarks before we call it a day:

The part of the Decoder that I refer to as postprocessing in the Figure above is similar to what one would typically find in the RNN Decoder for an NLP task: a fully connected (FC) layer, which follows the RNN that extracted certain features from the network's inputs, and a softmax layer on top of the FC one that will assign probabilities to each of the tokens in the model's vocabularly being the next element in the output sequence. At that point, we could use a beam search algorithm to keep the top few predictions at each step and choose the most likely output sequence at the end, or simply keep the top choice each time.

The Transformer architecture is the driving force behind many of the recent breakthroughs in the field of NLP. To put some hard numbers on that statement, lets turn to a metric called BLEU, commongly used to evaluate the quality of machine translations. The original Transformer achieved a score of 28.4 BLEU on an English-to-German translation task, and if that does not tell you much, suffices to say that it was better than the exisiting best result by over 2 BLEU!

Next, in the coming blog post we will discuss BERT (Bidirectional Encoder Representations from Transformers): contextualized word embeddings based on the Transformer (more precisely, Transformer's Encoder), and how to train a BERT-based machine reading comprehension model on the Scaleway GPU instances.