It seems pretty clear to me by now that GPT-2 is not as dangerous as OpenAI thought (or claimed to think) it might be.

The 774M version has been out there for a while, and although it only has half as many parameters as the biggest version, I don’t expect there to be any large qualitative leap between the two. After all, OpenAI’s staged release plan has given us two size-doublings already – from 124M to 355M, from 355M to 774M – and the differences after each doubling are surprisingly subtle, and overlaid on the same basic, recognizable strengths and weaknesses.

I’ve played with these models a lot this year, mostly via fine-tuning – it’s almost a hobby at this point. I’ve

fine-tuned them on all sorts of different texts, including this tumblr

fine-tuned them on mixtures of very different texts (not very interesting – it’ll decide which type of text it’s writing in any given sample and stick with it)

tried different optimizers and learning rates for fine-tuning

experimented with custom encodings (common tags –> single non-English characters) to fit more text into the window when fine-tuning on webpages

tried to generate longer texts by repeatedly feeding the output in as context (i.e. prompt)

twiddled all the sampling parameters (temperature, top-k / top-p / neither) vs. when sampling from any of the above

read over tons and tons of sampling output while monitoring a fine-tuning job, curating material for @uploadedyudkowsky, etc.

By now I think I have a good feel for the overall quality, and the quirks, of GPT-2 sampled text. IMO, the model is good at all sorts of interesting things, but arguably least good at the things required for disinformation applications and other bad stuff.

———

It is best at the smallest-scale aspects of text – it’s unsettlingly good at style, and I frequently see it produce what I’d call “good writing” on a phrase-by-phrase, sentence-by-sentence level. It is less good at larger-scale structure, like maintaining a consistent topic or (especially) making a structured argument with sub-parts larger than a few sentences.

Some of this is completely intuitive: GPT-2, which only learns from text, is at the largest disadvantage relatively humans in areas that require models of the outside world (since we experience that world in many non-textual ways), while there is much more parity in areas like style that are purely internal to language, especially written language.

Some of it is less intuitive. GPT-2 samples often lack some large-scale features of real text that seem very simple and predicable. For example, when generating fiction-like prose, it will frequently fail to track which characters are in a given scene (e.g. character A has some dialogue yet character B refers to them as if they’re not in the room), and has a shaky grasp of dialogue turn conventions (e.g. having the same character speak twice on successive lines). In nonfiction-like prose, it tends to maintain a “topic” via repeating a set of key phrases, but will often make wildly divergent or contradictory assertions about the topic without noting the discontinuity.

I suspect some of this can be chalked up to the fact that GPT-2 is trained as a language model, i.e. as something that predicts real text, which is not quite the same thing as generating fake text. Its training objective only cares about the distribution of training text, and does not encourage it to respond to its own predictive distribution in a stable or nice way. (Note that its predictive distribution, by construction, is different from real text in that it’s less surprising to the model – see this great paper.)

The fact that feeding samples from the predictive distribution back into GPT-2 for further prediction produces impressive “generated text,” and not garbage, is thus a happy accident rather than a optimization target. Indeed, getting this to happen requires judicious choice of the sampling method, and (op. cit.) some naive sampling methods do yield garbage.

Even with good sampling methods like top-p, the stability of sampling is somewhat brittle; when I’ve tried to generate texts longer than the context window via repeated “self-prompting,” I’ve noticed a phenomenon where the text will usually fall off a quality cliff after a certain point, suddenly becoming strikingly ungrammatical and typo-ridden and full of anomalous paragraph breaks. [EDIT 6/10/20: I now think this may have been due to a bug in my code, and in any event I no longer think it’s a robust property of GPT-2 generation.] My hypothesis is that this works like the panda/gibbon adversarial examples: the samples have an uncommonly high density of features GPT-2 can recognize, and eventually there’s a confluence of these that push in the same direction in some linear subspace (consider here the use of a non-saturating activation, gelu, in the transformer), which pushes the model far from the training manifold.

To zoom back out again, the model is capable of frequent brilliance at the phrase, sentence and even paragraph level, but its samples struggle with more global coherence across the scale of a short article or longer, and with maintaining recognizable positions that look like they refer to the real world. (In conjunction with the lower-level good writing, this often generates an amusing “insight porn” effect: it feels like someone is saying something very intelligent and interesting… if only you could figure out what.)

———

My knee-jerk reaction is that this makes the model relatively useless for disinformation. Telling it to “argue for X” or even “write about X” is quite difficult, while aiming for specific genres or styles is very effective.

The real situation is a little more subtle than that. The model is unusually good at making things that look like news stories, presumably because they are common in the training set; in OpenAI’s large collection of released unconditional samples, news-like text dominates. Thus, presuming you can find an effective way to feed a fake event into the model on the concept level, it will be able to generate convincing “fake news” that stays on topic and so forth.

This is what the creators of “GROVER” have done, albeit with a custom training corpus. Roughly, they’ve trained a transformer to understand the relation between a news headline and the corresponding story in a structured way, allowing them to feed in the core substance of a hypothetical news story via the headline. They then sample the body text, and (interestingly) loop back and generate the headline, overwriting the initial one.

What they show, basically, is that this lets you take a headline from Breitbart or Infowars or some “natural cancer cures” type website, generate from it a consistent news story in the style of a “real news” venue like the NYT, and then loop back and re-write the headline in a “real news” style as well. Perhaps unsurprisingly, MTurkers then rate the resulting texts as more trustworthy than the originals.

There is definitely something a little scary about this, especially in the way it does give you close control over the topic, something that’s difficult with simple text prompting. On the other hand… do we really believe that, in 2019, with Trump as president, that the Breitbart type of fake news is suffering from a stylistic credibility gap? That there are people ready to believe that vaccination is an evil conspiracy, but only if the claim comes with an article that sounds like the NYT or WaPo?

The niche filled by this technology for bad actors just doesn’t feel like a niche that needs filling. Lots of people will reshare articles on social media just based on the headline, without even clicking through, and people less trusting than this often (and sensibly) care about the actual source, not just the style. I’m just not sure there’s a role for a device that will let you register TotallyARealNewspaper.biz and then auto-fill it with articles that sound exactly like Paul Krugman telling you that immigration = genocide.

And then, too, there’s the observation that actually prompted this post: AFAIK, the bad actors are not doing this stuff. People have mostly used the technology for clearly-signposted fake subreddits and other harmless amusements. GROVER was created by academics as a threat modeling exercise, on the premise that bad actors could make such a thing, so we’d better be prepared. But where are the actual GPT-2 clickfarms? They totally could exist by now, but I’ve never heard of even a single one. (And trust me, it’s not like the text generation is so good that no one would ever notice.)