AI and machine learning algorithms are vulnerable to adversarial samples that have alterations from the originals. That’s especially problematic as natural language models become capable of generating humanlike text, because of their attractiveness to malicious actors who would use them to produce misleading media. In pursuit of a technique that illustrates the extent to which adversarial text can affect model prediction, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), the University of Hong Kong, and Singapore’s Agency for Science, Technology, and Research developed TextFooler, a baseline framework for synthesizing adversarial text examples. They claim in a paper that it was able to successfully attack three leading target models, including Google’s BERT.

“If those tools are vulnerable to purposeful adversarial attacking, then the consequences may be disastrous,” said Di Jin, MIT Ph.D. student and lead author on the paper, who noted that the adversarial examples produced by TextFooler could improve the robustness of AI models trained on them. “These tools need to have effective defense approaches to protect themselves, and in order to make such a safe defense system, we need to first examine the adversarial methods.”

The researchers assert that besides the ability to fool AI models, the outputs of a natural language “attacking” system like TextFooler should meet certain criteria: human prediction consistency, such that human predictions remain unchanged; semantic similarity, such that crafted examples bear the same meaning as the source; and language fluency, such that generated examples look natural and grammatical. TextFooler meets all three even when no model architecture or parameters (values that influence model performance) are available — i.e., black-box scenarios.

It achieves this by identifying the most important words for the target models and replacing them with semantically similar and grammatically correct words until the prediction is altered. TextFooler is applied to two different tasks — text classification and entailment (the relationship between text fragments in a sentence) — with the goal of changing the classification or invalidating the entailment judgment of the original models. For instance, given the input “The characters, cast in impossibly contrived situations, are totally estranged from reality,” TextFooler might output “The characters, cast in impossibly engineered circumstances, are fully estranged from reality.”

To evaluate TextFooler, the researchers applied it to text classification data sets with various properties, including news topic classification, fake news detection, and sentence- and document-level sentiment analysis, where the average text length ranged from tens of words to hundreds of words. For each data set, they trained the aforementioned state-of-the-art models on a training set before generating adversarial examples semantically similar to the test set to attack those models.

The team reports that on the adversarial examples, they managed to reduce the accuracy of almost all target models in all tasks to below 10% with fewer than 20% of the original words perturbed. Even for BERT, which attained relatively robust performance compared with the other models tested, TextFooler reduced its prediction accuracy by about 5 to 7 times on a classification task and about 9 to 22 times on an entailment task (where the goal was to judge whether a sentence could be derived from entailment, contradiction, or a neutral relationship).

“The system can be used or extended to attack any classification-based NLP models to test their robustness,” said Jin. “On the other hand, the generated adversaries can be used to improve the robustness and generalization of deep learning models via adversarial training, which is a critical direction of this work.”