An attempt to understand features and patterns learnt by a Fine-tuned BERT model

Photo by Katarzyna Pe on Unsplash

Note: This content was part of my talk at Analytics Vidhya’s DataHack Summit 2019.

There is a lot of buzz around NLP of late, especially after the advancement in transfer learning techniques and with the advent of architectures like transformers. As someone from the applied side of Machine learning, I feel that it is not only important to have models that can surpass the state of the art results in many benchmarks, It is also important to have models that are trustable, understandable and not a complete black box.

This post is an attempt to understand the learnings of BERT on task-specific training. Let’s start with how attention is implemented in a Transformer and how it can be leveraged for understanding the model ( Feel free to skip this section if you are already aware of it).

Attention! Attention!

Transformers use self-attention to encode the representation of its input sequences at each layer. With self-attention, All the words in the input sequence contribute to the representation ( encoding ) of the current token.

Let’s consider this example from Jalammar’s Blog ( I would highly recommend reading his blog post for a deeper understanding of transformers ). Here you could see that the representation of the word “Thinking” ( Z1 ) is formed by the contribution from other words in the sentence ( in this case “Machines”). The strength of the contribution of each word to the current word is determined by the attention scores ( Softmax scores ). It is similar to each word giving a part of itself to form a full representation of the current word.

The strength could be inferred as the semantic association of the words in the sentence to the current word. For example, the word “it” in the below visualization of an attention layer in a transformer, has a higher contribution from the words “The animal”. This could be inferred as a coreference resolution of the word “it”. This behaviour is what gives the transformers contextual representations/encodings.

Inferring association between tokens using attention. source: http://jalammar.github.io/illustrated-transformer/

These contribution strengths (attention scores) can be leveraged to understand the association between the tokens and thereby it can also be used to understand the learnings of the transformers. This is exactly what we are going to attempt in this post. We will try to understand the task-specific features learned by the transformer.

Task-specific features :

The paper — What does BERT look at ? (Clark et al., 2019) which got published earlier this year talks about the various linguistic and coreference patterns that are self-learned by a BERT model. Illustrating how syntax-sensitive behaviour can emerge from self-supervised training alone. This made me curious and wanted to try doing a similar study on task-specific features that BERT learn, after finetuning on a task.

Example of Aspect based sentiment analysis — Source: https://medium.com/seek-blog/your-guide-to-sentiment-analysis-344d43d225a7

The Task at hand :

The finetuning task that we would be using here is an Aspect-Based sentiment analysis task designed as a question answering / multi-class classification problem. This approach is inspired by this paper (Sun et al.,2019). With this approach of converting the sentiment dataset into question-answer pairs (as shown below ), the authors were able to achieve state of the art results on SEMEVAL dataset.

Aspect-based sentiment analysis as QA — https://arxiv.org/pdf/1903.09588v1.pdf

I have finetuned a BERT-base-uncased model on SEMEVAL 2014 dataset using huggingface’s transformers library and visualized the attention maps using bertviz.

Task-specific learnings :

Here I list a few of the interesting patterns that I observed by probing attention layers of the fine-tuned BERT model,

Aspect heads — Aspect word understanding :

I observed that head 9-8 mostly attends to the aspect related words in the review, that correspond to the aspect in the question ( word “service” in the below pictures gets a very high attention score from the word “waiter”). The aspect word in question (left side) in most cases have a higher contribution from the aspect word in the review ( right side ). So this could be considered to act as an aspect head.