They said that AIs developed by Microsoft and Alibaba beat humans at reading. While not entirely accurate, models developed by these companies were able to beat humans on some metric on some reading task. This post offers the intuition behind how Microsoft’s AI, the R-Net, made this happen.

First, the problem…

Given a passage P

“Tesla was born on 10 July [O.S. 28 June] 1856 into a Serb family in the village of Smiljan, Austrian Empire (modern-day Croatia). His father, Milutin Tesla, was a Serbian Orthodox priest. Tesla’s mother, Đuka Tesla (née Mandić), whose father was also an Orthodox priest,:10 had a talent for making home craft tools, mechanical appliances, and the ability to memorize Serbian epic poems. Đuka had never received a formal education. Nikola credited his eidetic memory and creative abilities to his mother’s genetics and influence. Tesla’s progenitors were from western Serbia, near Montenegro.:12”

And a question Q

“What were Tesla’s mother’s special abilities?”

Provide a continuous ‘span’ of text as the answer A

“making home craft tools, mechanical appliances, and the ability to memorize Serbian epic poems”

The Stanford Question Answering Dataset (aka SQuAD) contains ~500 articles, with around 100k pairs of Q & A (the above example was taken from there itself).

Before we get to Microsoft’s method for reading comprehension, lets briefly go over two concepts that are heavily used in their paper:

RNNs are a special kind of neural network, used to analyze temporal (or sequential) data. While standard feedforward neural networks don’t have a concept of memory, RNNs incorporate the same by working with a context vector that is ‘fed back’:

A typical RNN

Essentially, the output at any time step t is a function of the past context, and the current input.

A special kind of RNN is the Bi-directional RNN (BiRNN). While standard RNNs remember the historical context by ‘remembering’ past data, BiRNNs also traverse in the reverse direction, to understand context from the future:

BiRNN

It is important to note that while RNNs can theoretically remember any length of history, they are usually much better at incorporating short-term context rather than long-term information (>20–30 steps apart).

** R-Net mainly utilizes RNNs (more specifically, Gated Recurrent Units) to simulate the action of ‘reading’ a passage of text **.

Attention in Neural Networks is modeled after the way humans focus on a particular subset of their sensory input, and tune-out the rest.

It is employed in applications where you have a collection of data points, all of which may not be pertinent to the task at hand. In such cases, attention is computed as a softmax-weighted average of all points in the collection. The weight itself is computed as some non-linear function of the 1) vector-set and 2) some context.

Under the context “frisbee”, the network will focus on the actual frisbee and objects dealing with it, and tune out the rest.

** R-Net utilizes Attention to highlight some part of the text, under the context of another **.

The R-Net

At an intuitive level, R-Net performs reading comprehension in a way that is pretty similar to how a you or I would: by ‘reading’ (applying RNN) the text multiple times (3, to be exact), and ‘fine-tuning’ (using Attention) the vectorial representations of the terms better and better in each iteration.

Lets understand each pass of reading individually…

First reading: Cursory glance

We start off with standard token (aka word or term) vectors, using word-embeddings from Glove. However, humans usually understand the exact meaning of a word in the context of the terms surrounding it.

Consider the examples: “May happen” & “the fourth of May”, where the meaning of ‘may’ depends on the surrounding terms. Also note that background could come from the forward or backward direction. Hence, we use a BiRNN over standard word embeddings, to come up with better vectors.

This process is applied to both the question & passage.

Second reading: Question-based analysis

In the second pass, the network tunes word-representations from the passage in the context of the question itself.

Lets assume you are at the highlighted location in the passage:

“…had a talent for making home craft tools, mechanical appliances, and the ability to memorize Serbian epic poems. Đuka had never received a formal education…”

Given “making’’, if you were to apply Attention over the question-tokens, you would probably highlight:

“What were Tesla’s mother’s special abilities?”

In a similar vein, the network adjusts the vector for “making” to get it closer to “abilities” in a semantic sense.

This is done to all tokens in the passage — in essence, R-Net is forming links between the needs of the question, and relevant parts of the passage. The paper calls this part the “Gated Attention-based RNNs”.

Third reading: Self-aware, complete passage understanding

In the first pass, we understood tokens in the context of their nearby surrounding terms.

In the second, we improved our understanding with respect to the question at hand.

We are now in a position to get a bird’s eye view of the entire passage, to pinpoint those sections that actually help in answering the question. For this, having a short-term contextual view of surrounding terms is not enough. Consider the highlighted terms here:

Tesla’s mother, Đuka Tesla (née Mandić), whose father was also an Orthodox priest,:10 had a talent for making home craft tools, mechanical appliances, and the ability to memorize Serbian epic poems. Đuka had never received a formal education. Nikola credited his eidetic memory and creative abilities to his mother’s genetics and influence.

Both refer to abilities possessed by Tesla’s mother. However, while the former occurs around text that describes the said abilities (what we want), the second term links them to Tesla’s talents (what we don’t want).

To be able to pinpoint the right start and end location of the answer (which we will get to in the next step), we need to compare different similar-meaning terms in the passage to gauge what is unique about each. This is difficult with vanilla RNNs, since the two highlighted terms are pretty far away.

To resolve this, R-Net uses what they call ‘Self-Matched Attention’.

Why Self-Matched?

While applying Attention, we usually use some data (like a passage term) to weigh a set of vectors (like the question terms). In this iteration however, we will be using the current passage term to weigh tokens from the passage itself. This helps us differentiate the current term from similar-meaning terms in the rest of the passage. To enforce this further, this phase of reading is accomplished using a BiRNN.

This step of using self-matched attention, in my opinion, is the ‘magic’ behind R-Net: using Attention to compare far-off terms in the same passage.

Final step: Marking the answer

In the final step, R-Net uses a variant of Pointer Networks to figure out where the start & end points of the answer lie. Simply put:

We start by computing another Attention vector over the question text. This is used as the ‘starting context’ for this iteration. Using this knowledge, a set of weights (for each term in the passage) is computed for the starting index. The term that gets the highest weight is considered the ‘starting point’ of the answer.

Apart from weights, this two-step RNN also returns a new context — encoding information about the starting of the answer.

The above step is repeated again, this time with the new context instead of the question-based one, to compute weights for the end point of the answer.

And voila! We have a solution! (In fact, the answer shown in our example above is what R-Net actually came up with.)