Counteracting neural disinformation with Grover

Exploring the surprising effectiveness of a fake news generator for fake news detection

By Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi

Photo by Bank Phrom on Unsplash

Artificial Intelligence has enormous potential to benefit society. However, the same technology can cause harm, particularly if used by malicious adversaries. One important threat is that of “Neural Fake News”: machine-written disinformation at scale. Our position is that we need to invest in fundamental research towards tackling this threat now, before AI generation technology sufficiently matures so as to become usable by adversaries.

Toward this goal, our team of researchers at the University of Washington and the Allen Institute for Artificial Intelligence recently announced a paper titled “Defending Against Neural Fake News.” In our paper, we introduced Grover, a state-of-the-art model for studying and detecting neural fake news. Counterintuitively, Grover is good at spotting neural fake news because it learns to generate it. We’re excited that our work has attracted public interest (TechCrunch, GeekWire, Paul G. Allen School) and helped stimulate interesting discussions about this important research area. In this blog post, we hope to continue these discussions by answering some of the questions that we’ve been asked, in part by presenting additional experimental results that go beyond the scope of our paper.

In Part 1, we provide three new experimental results that suggest Grover is a strong detector of disinformation. First, we show that if an adversary were to mass-generate news, it becomes even easier to detect those articles as fake — with over 97% accuracy. This complements our experiments from the paper, in which we assumed a more challenging setting involving limited access to AI-written fake news articles. Second, we show strong performance in detecting fake news written by different classes of generators. Without seeing any examples from OpenAI’s GPT-2 model during the learning stage, Grover classifies 96% of GPT-2’s generations as machine-written. Third, Grover can even detect human-written fake news with high accuracy (98%).

In Part 2, we further study the range of attacks that might be used by an adversary seeking to spread disinformation. We show that simple techniques for fooling our Grover verifier, including “rejection sampling,” cannot be easily used by adversaries looking to share fake news. However, we demonstrate that Grover finds certain topics — such as financial news — more challenging, and we explore why this is the case.

To enable others wishing to fight neural fake news to build off of our work, we are releasing Grover publicly. A machine learning model consists of two parts: code and model parameters. We are releasing the code for Grover at github.com/rowanz/grover, along with the model parameters for Grover-Base and Grover-Large. These models are roughly on par with existing freely-available language models, and are thus unlikely to pose a new threat. We are also inviting researchers to apply to download our most powerful generator, Grover-Mega, in the hopes that they may build upon our work.

Part 1: Fake news detection is possible now, and it becomes easier with scale

In our paper, we introduced Grover, a powerful AI model capable of detecting — and generating — neural fake news. This is due to a finding that is perhaps counterintuitive: strong generators for neural fake news are themselves strong detectors of it. A generator of fake news will be most familiar with its own peculiarities, such as using overly common or predictable words, as well as the peculiarities of similar generators.

To evaluate the capabilities of Grover as a detector of neural fake news, we considered a challenging experimental setting. A detector model is provided 5000 examples of “neural fake news” from a generator model. A Grover model that has been adapted for detection in this setting achieves 92% classification accuracy at telling apart human-written “real” from machine-written “fake” news.

Still, a concern remains: if an adversary were to generate news articles at an extremely large scale, then might an accuracy of 92% leave many articles undetected? Though this concern is reasonable, there is a strong trend in machine learning that AI-performance increases when more data is provided. Thus, as the scale of neural disinformation by a single adversary increases, so too does the ease of flagging each of these generations as fake.

Grover detection accuracy versus the number of generations that are provided during training. The more examples Grover sees from an adversary, the easier fake news detection becomes.

To test the ease of detecting fake news when it is produced at scale, we studied a setting with up to 80,000 machine-written news articles. In this setting, Grover’s detection accuracy when given as few as 10k articles is 94%, but it increases to up to 97.5% when trained on 80,000 articles.

Grover excels at detecting news written by Grover models, but it can do even more

As shown in our paper, the best defense against Grover-written fake news is Grover itself. However, in additional experiments, we find that Grover is also the best defense against fake news from other sources.

First, Grover excels at spotting OpenAI GPT-2-written news as fake. Since the largest GPT-2 model is currently unavailable, even to researchers, we tested our Grover discriminator on the 500 generations that OpenAI uploaded publicly. [1] Without ever having seen any GPT-2 generations during training (a zero-shot setting), Grover correctly classifies 96.1% of them as machine-written.

Second, Grover is a strong detector of human-written fake news. Following the same methodology as in our paper, we collected 30k examples of human-written disinformation on Common Crawl. We then considered a classification setting in which a Grover discriminator is given 30k examples each of real news, machine-written fake news, as well as human-written fake news. In this setting, Grover excels at spotting both categories of fake news with over 95% accuracy for each. This result extends prior work (Rashkin et al., 2017), which found that human-written fake news is detectable through overuse of key phrases.

Grover succeeds at telling apart not just human-written real news from neural fake news, it can also tell apart real news from human-written disinformation. A Grover model that is trained to do both gets over 95% accuracy in both settings.

Part 2: Simple techniques aren’t enough to consistently fool Grover

As we have shown thus far, with minimal additional training, Grover becomes a very strong detector of neural fake news. However, we must also anticipate possible attacks that adversaries will develop, and probe Grover’s ability to defend against them. In this section, we seek to further explore the robustness of Grover against these attacks.

One possible concern is a rejection-sampling attack. If we assume an adversary has access to the exact same discriminator that a verifier is using, it can generate articles until one is found to fool the discriminator. This situation makes a strong, and rather unrealistic assumption: the adversary knows everything about the verifier, while the verifier knows nothing about the adversary. [2]

We tested the limits of this setting experimentally. We studied a situation in which an adversary has 5000 contexts (headlines, dates, authors, and domains) that they wish to generate articles for. For each context, they generate 64 generations using different Nucleus Sampling thresholds (Nucleus Sampling Top-P in [0.9, 1.0]; see Holtzman et al., 2019). They then select the generation that the default Grover discriminator — the exact same one that the verifier has — assigns the lowest probability to.

Grover’s standard detection accuracy, along with the accuracy of an adversary who has full access to the detection model and performs a Rejection sampling attack. This gives our hypothetical adversary only a temporary advantage, however, since retraining on the adversary’s generations returns the accuracy to a level we expect: over 89%.

If we make the assumption that a verifier cannot perform any additional training on this set of generations — itself another strong assumption — then the resulting detection accuracy is 13%.

However, after training the Grover detection model on 4000 of these generations, the accuracy on the final 1000 is 89.3%. In other words, rejection sampling produces only a temporary advantage to a hypothetical attacker, because once Grover is trained on additional generations from that attacker, its accuracy returns to levels we expect.

On a more technical side, our findings suggest that Grover as a detector cannot be repeatedly fooled by today’s neural generators. These ideas, of iterative adversarial attacks, has been recently introduced by SWAG (Zellers et al., 2018) and studied further by HellaSWAG (Zellers et al, 2019). The key takeaway is that adversaries can succeed with iterative rejection sampling (or “adversarial filtering”) only when the machine-written text is short: roughly three or fewer sentences. However, realistic news articles contain tens or even hundreds of sentences, making neural fake news easy to detect.

Some news is harder to detect than others

Detection accuracy by news source. The real articles in each category come from the actual website, while the neural fake ones are generated by Grover and written in the style of that domain.

While we find that rejection sampling is insufficient to fool Grover as a detector, we also find that some types of news are harder to classify than others. Above, we present Grover’s detection accuracy at telling apart real versus fake articles that were written in the style of a specific website. Grover is most successful at distinguishing real versus fake news based on “slate.com,” achieving 99.8% accuracy, with nearly as high a rate of success for content attributed to other major news outlets.

Interestingly, we find that financial news poses more of a challenge to Grover’s detection, with “americanbankingnews.com” a notable outlier at 60.5%. One possible explanation is that many articles on financial news websites are automatically generated based on stock prices. These articles are thus harder to distinguish between artificially produced fake news and the real thing.

Conclusion: Let’s open up the discussion

In this blog post, we’ve explored in more depth the use-case of a generator like Grover to detect neural fake news. Our results suggests that Grover can spot disinformation produced by a variety of different adversaries, both human and machine.

However, the exploration doesn’t stop there. We’ve released the code at github.com/rowanz/grover so other researchers can test out Grover themselves. We’ve also released model checkpoints for the smaller Grover models: Grover-Base and Grover-Large. Researchers can also apply here to download Grover-Mega as well as our large news corpus, RealNews.

Grover represents a strong first step towards the development of effective automated defense mechanisms against mass production of fake news by machines. We’re excited to see what further progress the community can make at defending against neural fake news.

Thanks to Kristin Osborne for providing feedback on this post.