Since Google Research introduced its Bidirectional Transformer (BERT) in 2018 the model has gained unprecedented popularity among researchers. BERT has set new records on 11 natural language processing (NLP) tasks, and more than half of the top 10 models on the GLUE (General Language Understanding Evaluation) Benchmark are built on top of BERT architecture.

Now, a group of researchers from the National Cheng Kung University Tainan in Taiwan are challenging BERT’s efficacy. Their paper Probing Neural Network Comprehension of Natural Language Arguments proposes that BERT’s impressive performance might be attributed to “exploitation of spurious statistical cues in the dataset” and that without them, BERT may be no better than random models. The paper has been accepted by the Association for Computational Linguistics (ACL).

The researchers introduced a new adversarial dataset that cut BERT’s accuracy from 77 percent to 53 percent, and suggest this dataset could be adopted as a standard for future performance evaluations.

What are spurious statistical cues?

The NLP community has built benchmark tests such as SQuAD and AllenAI as well as evaluation metrics like GLUE to test the performance of models. In 2018, a group of German researchers introduced The Argument Reasoning Comprehension Task (ARCT), designed to evaluate the inferencing ability of a language model. Given a premise/reason and a claim, the model has to infer to a warrant (why the claim follows from the premise) from two options. For example, given the claim “Google is not a harmful monopoly” and the premise “People can choose not to use Google,” the model should choose the correct warrant, which is “Other search engines don’t redirect to Google” and not the alternative, “All other search engines redirect to Google.”

Although the ARCT task raised the threshold for language models, BERT still scored 77 percent accuracy on it, only three points below the average (untrained) human baseline. That motivated the Cheng Kung University researchers to investigate why BERT works so well on ARCT. To their surprise, they discovered that the BERT model bases its predictions on spurious cues — for example it tends to choose the warrant that includes the word “not.” Across the dataset, even a random model can achieve 61 percent accuracy if it simply keeps choosing the warrant that contains the word “not.”

Spurious correlation was proposed in 1997, and denotes a mathematical relationship in which two or more events or variables are associated but not causally related. In this case, the word “not” is a type of spurious statistical cue that leads the model to choose the right answer but essentially has no causal relationship with the answer.

Moreover, bigrams that occurred with “not,” such as “will not” and “cannot,” were also found to be highly predictive.

Researchers conducted additional experiments to test their conjecture. Previously, BERT had been trained on warrant-reason-claim pairs, but now researchers trained BERT on only the warrant data. It scored 71 percent accuracy, only six points below its peak performance.

Researchers then created an adversarial dataset with the claims negated and the labels inverted so that the distribution of statistical cues was mirrored across warrant options, effectively eliminating the signals. This dropped BERT’s accuracy to 53 percent.

Worrisome trends in NLP

The ACL paper is not the first attempt to rethink the efficacy of large-scale neural networks in NLP tasks. In fact, since neural networks were first applied to NLP, many language experts and linguisticians have been skeptical on whether deep learning models which can efficiently map latent representation from data are actually able to understand the semantics of languages.

Earlier this year researchers from Johns Hopkins University and Brown University published a paper with similar conclusions: “a machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases.”

Some researchers have begun to question whether the machine learning community places too much emphasis on standard benchmark tests. Anna Rogers, a post-doctoral associate in the Text Machine Lab of Computer Science Department, University of Massachusetts Lowell, wrote in her blog that “If the reader’s main takeaway is going to be the leaderboard, that increases the perception that the publication-worthiness is only achieved by beating the SOTA.”

Rogers suggests the machine learning community should come up with new evaluation methods that reward those who innovate on new architectures, rather than having teams compete against each other by simply enlarging their models or throwing more data and compute into their existing research.

The authors stress their paper is not intended to disparage the value of BERT, which they regard as a strong machine learning system that can improve if the problematic spurious statistical cues are neutralized. “Analysis of easy to classify data points showed reliance on a lower proportion of the strongest cue word than the BoV and BiLSTM — i.e. BERT has learned when to ignore the presence of ‘not’ and focus on different cues. This indicates an ability to exploit much more subtle joint distributional information.”

Google AI Researcher David Ha echoed this view, “I think the features learned using a model like BERT are still very useful for many applications, and it is good that such papers remind us of their limitations. Makes us take a step back from squeezing out small gains, and take a look at the bigger picture of what we are doing.”

The paper Probing Neural Network Comprehension of Natural Language Arguments is on arXiv.