The heavy BERT

BERT became an essential ingredient of many NLP deep learning pipelines. It is considered a milestone in NLP, as ResNet is in the computer vision field.

The only problem with BERT is its size.

BERT-base is model contains 110M parameters. The larger variant BERT-large contains 340M parameters. It’s hard to deploy a model of such size into many environments with limited resources, such as a mobile or embedded systems.

Training and inference times are tremendous.

Training of BERT-base was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total), and training of BERT-large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.

Training time is usually not an issue for an end-user since it’s usually a one-time investment (actually a many-times investment because you’ll probably have to retrain a model several times until you get a satisfactory result). Nevertheless, it’s worth improving the training speed if you can. The faster you iterate, the sooner you’ll solve your problem.

BERT inference times vary depending on the model and hardware available but in many cases, it significantly limits you on how much, how fast, and how cheap do you want to process your data. For some real-time applications, it could be prohibitive.

Optimizing neural networks

This set of problems is not new in neural networks. Other domains (say, computer vision) had the same issues before, and several approaches to compress and speed-up NN models have been developed.

These approaches roughly can be divided into several groups:

Architecture improvements (change the architecture to a faster one, say, replace RNN to a Transformer or a CNN; use layers that require fewer computations and so on) or more clever optimization (learning rate and policy, number of warmup steps, larger batch size, etc). Model compression (usually done using quantization and/or pruning, reducing the total amount of computations keeping the architecture unchanged. OK, mostly unchanged). Model distillation (train a smaller model that will replicate the behavior of the original model)

Let’s look at what can be done with BERT regarding these approaches.

1. Architecture and optimization improvements

Large-scale distributed training

The first (or even zeroth) thing to speed up BERT training is to distribute it on a larger cluster. While the original BERT was already trained using several machines, there are some optimized solutions for distributed training of BERT (e.g. from Alibaba or NVIDIA).

A recent record was set by NVIDIA that trained BERT-large in 53 minutes using (a very expensive) NVIDIA DGX SuperPOD with 92 DGX-2H nodes with a total of 1,472 V100 GPUs (which theoretically can deliver up to 190 PFLOPS).

Another example of a more clever optimization (and using super-powerful hardware) is a new layerwise adaptive large batch optimization technique called LAMB which allowed reducing BERT training time from 3 days to just 76 minutes on a (very expensive as well) TPUv3 Pod (1024 TPUv3 chips that can provide more than 100 PFLOPS performance for mixed-precision computing).

Architectures

The stacking algorithm

Regarding more architecture- and less hardware-solutions, there is progressive stacking method for training BERT based upon an observation of self-attention layers behavior showing that its distribution concentrates locally around its position and the start-of-sentence token and that the attention distribution in the shallow model is similar to that of a deep model. Motivated by this, authors proposed the stacking algorithm to transfer knowledge from a shallow model to a deep model; and applied stacking progressively to accelerate BERT training. Authors achieved the training time about 25% shorter than the original BERT. This is mainly because for the same number of steps, training a small model needs less computation.

Other architectural improvements reducing the total amount of memory and/or computation are sparse factorizations of the attention matrix (aka Sparse Transformer by OpenAI) and block attention.

ALBERT

And finally, there is a possible architectural descendant of BERT called ALBERT (A Lite BERT) submitted to ICLR 2020 conference.

ALBERT incorporates two parameter reduction techniques.

The first one is a factorized embedding parameterization, separating the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings.

The second technique is a cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network.

Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter-efficiency.

An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.

It even outperforms heavily tuned RoBERTa!

2. Quantization and pruning

Quantization decreases the numerical precision of a model’s weights.

Typically models trained using FP32 (32-bit floating point), then they can be quantized into FP16 (16-bit floating point), INT8 (8-bit integer) or even more to INT4 or INT1, so reducing the model size 2x, 4x, 8x or 32x respectively. This is called post-training quantization.

Another (harder and a less mature) option is a quantization-aware training. FP16 training is becoming a commodity now. ICLR 2020 has an interesting submission on the state-of-the-art training results using 8-bit floating point representation, across Resnet, GNMT, Transformer.

Pruning removes some (not- or less-important) weights (or sometimes neurons) from the model, producing sparse weight matrices (or smaller layers). There is also research on removing entire matrices corresponding to attention heads of a transformer.

Quantization can be performed using Tensorflow Lite, a part of Tensorflow for on-device inference. TensorFlow Lite provides the tools to convert and run TensorFlow models on mobile, embedded and IoT devices. TensorFlow Lite supports post-training quantization and quantization-aware training.

Another option is to use TensorRT framework from NVIDIA. NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.

Recently NVIDIA announced TensorRT 6 with new optimizations that deliver inference for BERT-Large in only 5.8 ms on T4 GPUs and 4.2 ms on V100. For Titan RTX is should be faster, rough estimate using the peak performance (you can find the numbers here) of these cards gives 2x speedup, but in reality, it’ll probably be smaller.

5.84 ms for a 340M parameters BERT-large model and 2.07 ms for a 110M BERT-base with a batch size of one are cool numbers. With a larger batch size of 128, you can process up to 250 sentences/sec using BERT-large.

More numbers can be found here.

PyTorch recently announced quantization support since version 1.3. It is experimental right now, but you can already start using it thanks to the tutorial in which dynamic quantization is applied to an LSTM language model converting the model weights to INT8.

There is a well-known quantization of BERT called Q-BERT (from the “Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT” paper). The authors can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations.

3. Distillation

Another interesting model compression method is distillation — a technique that transfers the knowledge of a large “teacher” network to a smaller “student” network. The “student” network is trained to mimic the behaviors of the “teacher” network.

A version of this strategy has already been pioneered by Rich Caruana and his collaborators. In their important paper, they demonstrate convincingly that the knowledge acquired by a large ensemble of models can be transferred to a single small model.

Geoffrey Hinton et al. showed this technique can be applied to neural networks in their paper called “Distilling the Knowledge in a Neural Network”.

DistilBERT

Since then this approach was applied to different neural networks, and you probably heard of a BERT distillation called DistilBERT by HuggingFace.

Finally, October 2nd a paper on DistilBERT called “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” emerged and was submitted at NeurIPS 2019.

DistilBERT is a smaller language model, trained from the supervision of BERT in which authors removed the token-type embeddings and the pooler (used for the next sentence classification task) and kept the rest of the architecture identical while reducing the numbers of layers by a factor of two.

You can use DistilBERT off-the-shelf in the with the help of the transformers python package by HuggingFace (formerly known as pytorch-transformers and pytorch-pretrained-bert). Version 2.0.0 of the package supports TensorFlow 2.0/PyTorch interoperability.

DistilBERT authors also used a few training tricks from the recent RoBERTa paper which showed that the way BERT is trained is crucial for its final performance.

DistilBERT compares surprisingly well to BERT: authors were able to retain more than 95% of the performance while having 40% fewer parameters.