Three years ago the Microsoft online digital assistant “Tay” ignited a firestorm of controversy when users tricked it into spewing racist and inflammatory tweets. Now a group of researchers from the Seattle-based Allen Institute for Artificial Intelligence (AI2) have shown how trigger words and phrases can “inflict targeted errors” on natural language processing (NLP) model outputs, prompting them to generate racist and hostile content.

The researchers discovered that adding adversarial content to texts caused the Stanford Natural Language Inference (SNLI) classification accuracy score to plummet from almost 90 percent to less than one percent.

Prepending paragraphs with certain trigger phrases seriously reduced performance of an ELMo-based SQuAD model, and a SOTA GPT-2 language model was made to spew racist output even when conditioned on non-racial contexts.

The trigger search algorithm

These “Universal Adversarial Triggers” are input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. Researchers proposed a gradient-guided search over tokens to identify short phrases that could successfully trigger the target prediction.

For text classification, researchers came up with two tasks: sentiment analysis and natural language inference. One of the trigger phrases they found for sentiment analysis is “Zoning tapping fiennes.” When prepending this trigger to positive movie reviews, an LSTM model’s accuracy drops from 86 percent to 29 percent.

Researchers also identified one-word triggers that rendered textual entailment (evaluating a relationship between a text and an hypothesis) models useless. For example, when “nobody” is added to the front of hypothesis sentences, 99 percent are predicted as contradictions.

In reading comprehension, prepending text blocks with the trigger “why how because to kill american people” caused an ELMo-based SQuAD model to predict the exact answer “to kill american people” for 72 percent of all “why” questions it encountered.

For language generation, researchers found the trigger “TH PEOPLEMan goddreams Blacks” led to highly racist and disturbing text (63 percent of generated samples contain an explicitly racist statement).

Researchers also shared the following insights:

Triggers are transferable across models, which increases their adversarial threat level: They can simply generate an attack leveraging their local model and transfer it to the target model

These input-agnostic triggers provide new insights into “global” model behavior, such as general input-output patterns learned from a dataset.

Despite the strong progress in NLP over the past few years thanks to the wide adoption of deep learning, the research results show that NLP models remain vulnerable to adversarial attacks.

The paper Universal Adversarial Triggers for Attacking and Analyzing NLP is on arXiv. You can apply for the trigger to the ELMo-based SQuAD model here. And a live demo of the trigger for GPT-2 is available here.