BERT architecture

BERT uses the part of the Transformer network architecture introduced by the paper “Attention is all you need”. The advantage of this architecture is that it makes it possible to deal with relationships between distant words better than recurring networks (LSTM / GRU).

On the other hand, the network cannot process sequences of any length; instead, it has a finite entry dimension that cannot be too large (about 300–500 words in most networks). Otherwise, the network is unable to learn anything significant in a reasonable amount of training time. This limit can be circumvented by re-introducing recurring networks over it—for example, Transformer-XL—but we won’t be covering that in this article.

Google researchers have proposed several pre-trained versions of their model. To facilitate the explanations herein, we’ll use the model that uses only lowercase letters, with a single language, 12 layers, a hidden layer dimension of 768 and 12 attention heads (BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters).

Don’t worry if you don’t know what all this means—we’ll explain more thoroughly later. The purpose of this article is to present you with a high-level view of the BERT architecture.

As mentioned above, BERT is made up of 12 layers, so let’s see what these famous layers are made of based on a Transformer architecture. In order, we have the following sublayers:

The attention block

The normalization layer

The feed forward layer

The normalization layer

Embedding

The first step is the embedding. This allows us to transform our words into vectors. The entry for BERT is the sum of 3 embeddings: the embedding token, the embedding position, and the embedding segment.

Token Embedding

The embedding token provides information on the content of the text. The first thing to do is to transform our text into a vector. To avoid having too large of a vocabulary size and to be able to deal with new words, they use a token system. A token is one or more letters—we’ll see some examples later. To find the tokens that will break down the input words, we can use the BPE algorithm (bytes pair encoding). BERT uses another algorithm but the principle remains similar.

To start, we have to find the most frequent letter pair in our word corpus. For example, here we’ll say that it is “Lynch”, so we replace this pair with a new token—let’s name it X. We then replace all occurrences of “Lynch” with X. And we start again until we reach the number of tokens that suits us (30,000 tokens for BERT). In addition, we always add the letters alone as a token, which allows us to handle words never encountered.

Once we have our token list, we can deterministically transform our text into a token and vice versa. For this, we replace the occurrences of the different tokens, from the largest to the smallest. In the worst case, we represent our word by each of its letters. Each token can now be encoded on a vector of size 30,000 with zeros except in one place where we put the number one—that’s called one-hot encoding.

Masked Language Model (LM)

BERT learns by masking 15% of the WordPiece, then 80% of those get replaced with a “Mask” token, 10% with random tokens, and the rest keep the original word. The loss is then defined as how well the model predicts the missing word.

The LM task forces the model to encapsulate a significant part of NLP as well aspects of syntax and semantics. To make such a model, add a softmax layer that converts the hidden vector of the last layer of the Transformer into a probability of the vocabulary words.