Distilling BERT Models with spaCy

By Yves Peirsman, 16 August 2019

Transfer learning is one of the most impactful recent breakthroughs in Natural Language Processing. Less than a year after its release, Google's BERT and its offspring (RoBERTa, XLNet, etc.) dominate most of the NLP leaderboards. While it can be a headache to put these enormous models into production, various solutions exist to reduce their size considerably. At NLP Town we successfully applied model distillation to train spaCy's text classifier to perform almost as well as BERT on sentiment analysis of product reviews.

Recently the standard approach to Natural Language Processing has changed drastically. Whereas until one year ago, almost all NLP models were trained entirely from scratch (usually with the exception of their pre-trained word embeddings), today the safest road to success is to download a pre-trained model such as BERT and finetune it for your particular NLP task. Because these transfer-learning models have already seen a large collection of unlabelled texts, they have acquired a lot of knowledge about language: they are aware of word and sentence meaning, co-reference, syntax, and so on. Exciting as this revolution may be, models like BERT have so many parameters they are fairly slow and resource-intensive. For some NLP tasks at least, finetuning BERT feels like using a sledgehammer to crack a nut.

Sledgehammer models

Most transfer-learning models are huge. BERT's base and multilingual models are transformers with 12 layers, a hidden size of 768 and 12 self-attention heads - no less than 110 million parameters in total. BERT-large sports a whopping 340M parameters. Still, BERT dwarfs in comparison to even more recent models, such as Facebook's XLM with 665M parameters and OpenAI's GPT-2 with 774M. It certainly looks like this evolution towards ever larger models is set to continue for a while.

General models like BERT can be finetuned for particular NLP tasks (from: Devlin et al. 2018).

Of course, language is a complex phenomenon. It's obvious that more traditional, smaller models with relatively few parameters will not be able to handle all NLP tasks you throw at them. For individual text classification or sequence labelling tasks, however, it's questionable whether all the expressive power of BERT and its peers is really needed. That's why researchers have begun investigating how we can bring down the size of these models. Three possible approaches have emerged: quantization reduces the precision of the weights in a model by encoding them in fewer bits, pruning completely removes certain parts of a model (connection weights, neurons or even full weight matrices), while in distillation the goal is to train a small model to mimic the behaviour of a larger one.

Model distillation for sentiment analysis

In one of our summer projects at NLP Town, together with our intern Simon Lepercq, we set out to investigate the effectiveness of model distillation for sentiment analysis. Like Pang, Lee and Vaithyanathan in their seminal paper, our goal was to build an NLP model that was able to distinguish between positive and negative reviews. We collected product reviews in six languages: English, Dutch, French, German, Italian and Spanish. The reviews with one or two stars we gave the label negative , and those with four or five stars we considered positive . We used 1000 examples for training, 1000 for development (early stopping) and 1000 examples for testing.

The first step was to determine a baseline for our task. With an equal number of positive and negative examples in each of our data sets, a random baseline would obtain an accuracy of 50% on average. As a simple machine learning baseline, we trained a spaCy text classification model: a stacked ensemble of a bag-of-words model and a fairly simple convolutional neural network with mean pooling and attention. To this we added an output layer of one node and had the model predict positive when its output score was higher than 0.5 and negative otherwise. This baseline achieved an accuracy of between 79.5% (for Italian) and 83.4% (for French) on the test data - not bad, but not a great result either.

BERT gives an average error reduction of 45% over our simpler spaCy models.

Because of its small training set, our challenge is extremely suitable for transfer learning. Even if a test phrase such as great book is not present in the training data, BERT already knows it is similar to excellent novel, fantastic read, or another similar phrase that may very well occur in the training set. As a result, it should be able to predict the rating for an unseen review much more reliably than a simple model trained from scratch.

To finetune BERT, we adapted the BERTForSequenceClassification class in the PyTorch-Transformers library for binary classification. For all six languages we finetuned BERT-multilingual-cased , the multilingual model Google currently recommends. The results confirm our expectations: with accuracies between 87.2% (for Dutch) and 91.9% (for Spanish), BERT outperforms our initial spaCy models by an impressive 8.4% on average. This means BERT nearly halves the number of errors on the test set.

Model distillation

Unfortunately, BERT is not without its drawbacks. Each of our six finetuned models takes up almost 700MB on disk and their inference times are much longer than spaCy's. That makes them hard to deploy on a device with limited resources or for many users in parallel. To address these challenges, we turn to model distillation: we have our finetuned BERT models serve as teachers and spaCy's simpler convolutional models as students that learn to mimic the teacher's behavior. We follow the model distillation approach described by Tang et al. (2019), who show it is possible to distill BERT to a simple BiLSTM and achieve results similar to an ELMo model with 100 times more parameters.

The process of model distillation.

Before we can start training our small models, however, we need more data. In order to learn and mimic BERT's behavior, our students need to see more examples than the original training sets can offer. Tang et al. therefore apply three methods for data augmentation (the creation of synthetic training data on the basis of the original training data):

mask random words in the training data. For example, I like this book now becomes I [MASK] this book.

random words in the training data. For example, I like this book now becomes I [MASK] this book. replace other random words in the training data by another word with the same part of speech. For example, I like this book becomes I like this screen.

other random words in the training data by another word with the same part of speech. For example, I like this book becomes I like this screen. sample a random n-gram of length 1 to 5 from the training example.

Since the product reviews in our data set can be fairly long, we add a fourth method to the three above:

sample a random sentence from the training example.

These augmentation methods not only help us create a training set that is many times larger than the original one; by sampling and replacing various parts of the training data, they also inform the student model about what words or phrases have an impact on the output of its teacher. Moreover, in order to give it as much information as possible, we don't show the student the label its teacher predicted for an item, but its precise output values. In this way, the small model can learn how probable the best class was exactly, and how it compared to the other one(s). Tang et al. (2019) trained the small model with the logits of its teacher, but our experiments show using the probabilities can also give very good results.

Distillation results

One of the great advantages of model distillation is that it is model agnostic: the teacher model can be a black box, and the student model can have any architecture we like. To keep our experiments simple, we chose as our student the same spaCy text classifier as we did for our baselines. The training procedure, too, remained the same: we used the same batch sizes, learning rate, dropout and loss function, and stopped training when the accuracy on the development data stopped going up. We used the augmentation methods above to put together a synthetic data set of around 60,000 examples for each language. We then collected the predictions of the finetuned BERT models for this data. Together with the original training data, this became the training data for our smaller spaCy models.

The distilled spaCy models perform almost as well as the original BERT models.

Despite this simple setup, the distilled spaCy models outperformed our initial spaCy baselines by a clear margin. On average, they gave an improvement in accuracy of 7.3% (just 1% below the BERT models) and an error reduction of 39%. Their performance demonstrates that for a particular task such as sentiment analysis, we don't need all the expressive power that BERT offers. It is perfectly possible to train a model that performs almost as well as BERT, but with many fewer parameters.

Conclusion

With the growing popularity of large transfer-learning models, putting NLP solutions into production is becoming more challenging. Approaches like model distillation, however, show that for many tasks you don't need hundreds of millions of parameters to achieve high accuracies. Our experiments with sentiment analysis in six languages demonstrate it is possible to train spaCy's convolutional neural network to rival much more complex model architectures such as BERT's. In the future, we hope to investigate model distillation in more detail at NLP Town. For example, we aim to find out what data augmentation methods are most effective, or how much synthetic data we need to train a smaller model.