Understanding BERT Transformer: Attention isn’t all you need

A parsing/composition framework for understanding Transformers

Why BERT matters

BERT is a recent natural language processing model that has shown groundbreaking results in many tasks such as question answering, natural language inference and paraphrase detection. Since it is openly available, it has become popular in the research community.

The following graph shows the evolution of scores for GLUE benchmark — the average of scores in various NLP evaluation tasks.

While it’s not clear that all GLUE tasks are very meaningful, generic models based on an encoder named Transformer (Open-GPT, BERT and BigBird), closed the gap between task-dedicated models and human performance and within less than a year.

However, as Yoav Goldberg notes it, we don’t fully undestand how the Transformer encodes sentences.

[Transformers] in contrast to RNNs — relies purely on attention mechanisms, and does not have an explicit notion of word order beyond marking each word with its absolute-position embedding. This reliance on attention may lead one to expect decreased performance on syntax-sensitive tasks compared to RNN (LSTM) models that do model word order directly, and explicitly track states across the sentence.

Several articles delve into the technicalities of BERT. Here, we will try to deliver some new insights and hypotheses that could explain BERT’s strong capabilities.

A framework for language understanding: parsing/composition

The way humans are able to understand language has been a long-standing philosophical question. In the 20th century, two complementary principles shed light on this problem:

The Compositionality principle states that the meaning of word compounds is derived from the meaning of the individual words, and the manner in which those words are combined. According to this principle, the meaning of the noun phrase “carnivorous plants”, can be derived from the meaning of “carnivorous” and the meaning of “plant” through a process named composition. [Szabó 2017]

The other principle is the hierarchical structure of language. It states that through analysis, sentences can be broken down into simple structures such as clauses. Clauses can be broken down into verb phrases and noun phrases and so on.

Parsing hierarchical structures and deriving meaning from their components recursively until sentence level is reached is an appealing recipe for language understanding. Consider the sentence “Bart watched a squirrel with binoculars”. A good parsing component could yield the following parse tree:

A constituency-based parse tree of the sentence “Bart watched a squirrel with binoculars”

The meaning of the sentence could be derived from successive compositions (composing “a” and “squirrel”, “watched” with “a squirrel”, “watched a squirrel” and “with binoculars”) until the sentence meaning is obtained.

Vector spaces (as in word embeddings) can be used to represent words, phrases, and other constituents. Composition could be framed as a function f which would compose (“a”,”squirrel”) into a meaningful vector representation of “a squirrel” = f(“a”,”squirrel”). [Baroni 2014]

However, composition and parsing are both hard tasks, and they need one another.

Obviously, composition relies on the result of parsing to determine what ought to be composed. But even with the right inputs, composition is a difficult problem. For example, the meaning of adjectives changes depending on the word they characterize: the color of “white wine” is actually yellow-ish while a white cat is actually rather white. This phenomenon is known as co-composition. [Pustejovsky 2017]

Representations of “white wine” and “white cat” in a two-dimensional semantic space (with color dimensions)

A broader context can also be necessary for composition. For instance, the way the words in “green light” should be composed depends on the situation. A green light can denote an authorization or an actual green light. The meaning of some idiomatic expressions requires a form of memorization rather than composition per se. Thus, performing those compositions in the vector space requires powerful nonlinear functions like a deep neural network (that can also memorize [Arpit 2017]).

Conversely, the parsing operation arguably needs composition in order to work in some cases. Consider the following parse tree of the same previous sentence “Bart watched a squirrel with binoculars”.

Another constituency-based parse tree of the sentence “Bart watched a squirrel with binoculars”

While it is syntactically valid, this parse leads to an odd interpretation of the sentence where Bart watches (with his bare eyes) a squirrel holding binoculars. However, some form of composition must be used in order to figure out that a squirrel holding binoculars is an unlikely event.

More generally, many disambiguations and integrations of background knowledge have to go on before the appropriate structures are derived. But this derivation might also be achieved with some forms of parsing and composition.

Several models have tried to put the combination of parsing and composition in practice [Socher 2013], however they relied on a restrictive setup with manually annotated standard parse trees, and have been outperformed by much simpler models.

How BERT implements parsing/composition

We hypothesize that Transformers rely heavily on these two operations, in an innovative way: since composition needs parsing, and parsing needs composition, Transformers use an iterative process, with successive parsing and composition steps , in order to solve the interdependence problem. Indeed, Transformers are made of several stacked layers (also called blocks). Each block consists of an attention layer followed by a non-linear function applied at each token.

We will try to highlight the link between those components and the parsing/composition framework.

A transformer block, seen as successive parsing and composition steps

Attention as a parsing step

In BERT, an attention mechanism lets each token from the input sequence (e.g. sentences made of word or subwords tokens) focus on any other token.

For illustration purposes, we use the visualization tool from this article to delve into the attention heads and test our hypothesis on the pre-trained BERT base uncased model. In the following illustration of an attention head, the word “it” attends to every other token and seems to focus on “street” and “animal”.

Visualization of attention values on layer 0 head #1, for the token “it”.

BERT uses 12 separate attention mechanism for each layer. Therefore, at each layer, each token can focus on 12 distinct aspects of other tokens. Since Transformers use many distinct attention heads (12*12=144 for the base BERT model), each head can focus on a different kind of constituent combinations.

We ignored the values of attention related to the “[CLS]” and “[SEP]” token. We tried using several sentences, and it’s hard not to overinterpret results, so you can feel free to test our hypothesis on this colab notebook with different sentences. Please note that in the figures, the left sequence attends to the right sequence.

In the second layer, attention head #1 seems to form constituents based on relatedness.