GPT-2 generates text that is far more realistic than any text generation system before it. OpenAI was so shocked by the quality of the output that they decided that the full GPT-2 model was too dangerous to release because it could be used to create endless amounts of fake news that could fool the public or clog up search engines like Google.

How easy it is for an average person to generate fake news that could trick a real person and how good are the results?

Let’s explore how a system like this could work and how much of a threat it is. Let’s try to build a newspaper populated with fake, computer generated news:

newsyoucantuse.com, my totally-fake, 100% AI-generated newspaper

To populate News You Can’t Use, we’ll create a Python script that can ‘clone’ a news site like the New York Times and generate artificial news stories on the same topics. Then, we’ll test the quality of the output and discuss some of the ethical issues that this kind of model raises.

The Entire Universe, One Word at a Time

The text generated by GPT-2 shows off a huge amount of knowledge about the world. GPT-2 was trained on many gigabytes of text scraped from the web and it has encoded a lot of that knowledge into the model. So when it generates text, it uses that knowledge to make the sentences more realistic. This is what makes it so powerful — the text it generates actually contains real world facts and figures.

If we enter a piece of starting text like this:

It will generate a sentence that fits that specific historical person:

This sentence is fascinating for several reasons:

First, it shows that the model has encoded that Abraham Lincoln was a person in America who was born in 1809.

Second, the sentence is perfectly formatted and indistinguishable from something written by a human.

Third, the sentence is entirely wrong.

Abraham Lincoln was born on February 12th, not April 4th. And he was born in Hodgenville, Kentucky, not Springfield, Illinois! But let’s be honest, you didn’t know that, did you? It sounded right!

That’s GPT-2 in a nutshell. It’s just a statistical model of how English is used on the web. It doesn’t have a database of facts and figures to draw from like a ‘Question and Answer’ system. Instead, it has gotten so good at predicting the next word in any given situation that it has accidentally encoded a weird, squishy version of all human knowledge within the model.

Abraham Lincoln did start his career as a Lawyer in Springfield, Illinois, and Springfield is a city, so Springfield makes sense as a probable word that might show up in a sentence about Abraham Lincoln.

The other amazing ability of GPT-2 is that it can maintain coherency across several paragraphs of text. It’s not just stringing together individual words that sound plausible, but it’s creating text where pronouns agree with each other and the people and ideas mentioned in the text are consistent throughout. This is where it really blows away previous models.

How GPT-2 Works

There are two main things that make GPT-2 work better than previous text generation models.

The first is that it is much bigger than other models. GPT-2 is absolutely huge (1.5 billion parameters) trained on a massive dataset using a remarkable amount of GPU-based computing power.

When OpenAI decided not to release the full GPT-2 model, they were confident that it couldn’t be easily reproduced by individuals without specialized technical expertise and massive computing resources. But computing power gets cheaper and more accessible every day. Within a few months, two grad students named Aaron Gokaslan and Vanya Cohen were able reproduce the model with $50k in cloud computing credits. And unlike OpenAI, they released their results. Several other teams were also able to replicate the model as well.

The second secret that makes GPT-2 work is that it’s built on top of a brand new way to model text that is significantly better than anything before it — the Transformer.

To understand how Transformer-based models work, we need to start by looking at how things used to work.

Bag of Words models

One of the simplest language models is called a “bag of words” model, or BoW. If you want to predict the next word in a sentence, you tell it which words have appeared in the sentence so far — in no particular order — and ask it to tell you the most likely next word.

Let’s use the starting text “Abraham Lincoln was” and ask it to predict the next word:

The words get passed into the model in no particular order, as if we threw them all in a bag and handed the jumbled mess to the model. The advantage to this model is that we use any off-the-shelf machine learning algorithm without any changes. But the obvious problem is that the model can only consider the presence of words, not the order that they appeared in. Word order matters a lot in language. Needing to model word order is what makes it hard to adapt traditional machine learning algorithms to language.

Sequence Models and RNNs

In the early 2010’s, Recurrent Neural Networks, or RNNs, became very popular for text modeling. They combine the flexibility of neural networks with an inherent ability to work well with variable-length sequences of text.

In an RNN, the model builds up an understanding of the sentence by looking at each word in sequence. As it sees each word, it updates a memory cell with what it thinks the current word means — thus generating a memory that represents the meaning of the whole sentence so far. When you finally ask it to predict the next word, it’s predicting based on this accumulated memory:

RNNs gave us a way to take everything we had learned building image classification systems with deep neural networks and apply it to language modeling where sentences vary in length. Within just a year or two, everything from automatic language translation systems to speech recognition systems got a lot better.

This is when all those “this text was generated by an AI” memes first started popping up on the internet. A well-trained RNN can generate text that looks pretty real — at least for a sentence or two.

Making RNNs better with Attention

At the time, RNNs seemed like the obvious future of text modeling, so researchers kept looking for ways to improve their ability to model text. The most significant advance was adding an Attention mechanism.

A normal RNN remembers the words it has seen so far as a single array of numbers in its internal memory. Each new word it sees updates that memory, so more recent words tend to overpower earlier words. As a result, RNNs don’t work very well on long pieces of text.

With an Attention mechanism, the model keeps track of what its memory was after each word. Then when it predicts the next word, it bases its prediction on a weighted version of all these past memories:

By weighting the old memory states differently, it “pays attention” to each word in the sentence more or less depending on how it thinks that word will help predict the next word.

Attention models led to significant performance improvements over standard RNNs in almost every situation, including text generation. But they still weren’t able to generate text with coherent ideas across entire paragraphs of text.

Replacing RNNs with the Transformer

After seeing how much adding Attention to RNNs improved their accuracy, researchers started to wonder if the RNN itself was even really necessary. Maybe the way that Attention weights the predictive value of each previous word in the sentence is what really mattered.

Imagine that we randomly drop a word out of a real sentence:

If we train the model to predict the missing word based on the remaining words in the sentence, we could also measure how much each remaining word was instrumental in that prediction. That would give as a measurement of how strongly each word in the sentence is related to the missing word:

We can keep doing this for other words in the sentence, too. Let’s see which words matter most if we drop out “Lincoln”:

We can repeat this for every word in the sentence to find out how strongly-related each word is to every other word. And if we do this over millions and millions of sentences scraped from the web, the model will learn how every word in English relates to every other word in every possible context. This is what happens inside of a transformer module.

A transformer module encodes the meaning of each word based entirely on the words around it in the current sentence. Using sentence context is what makes it work so well. A transformer has no problem understanding that the same word can mean totally different things in different contexts. This is why it can model sentences with pronouns so effectively. The meaning of words like “he” or “she” are defined based on the other words currently in the sentence.

And as years of deep learning research have taught us, if something works well on its own, why not try to stack them? Just like with layers in a deep neural network, we can stack transformer modules on top of each other: