Aaron Gokaslan*, Vanya Cohen*, Ellie Pavlick, Stefanie Tellex | Brown University

Introduction

Recently, large language models like BERT¹, XLNet², GPT-2³, and Grover⁴ have demonstrated impressive results in generating text and on multiple NLP tasks. Since Open-AI has not released their largest model at this time (but has released their 774M param model), we seek to replicate their 1.5B model to allow others to build on our pretrained model and further improve it.

You can access the model and generate text using our Google Colab.

We’ve also made the model weights available separately.

Replication

Radford’s et al’s³ security strategy of delaying the release of the model relies on these models being difficult to replicate and requiring a high degree of specialized domain knowledge. We demonstrate that many of the results of the paper can be replicated by two masters students, with no prior experience in language modeling. Because of the relative ease of replicating this model, an overwhelming number of interested parties could replicate GPT-2. Further, Zellers et al.⁴ shows that large language models like GPT-2 are an invaluable tool for countering the use of the same models as text generators.

Because our replication efforts are not unique, and large language models are the current most effective means of countering generated text, we believe releasing our model is a reasonable first step towards countering the potential future abuse of these kinds of models.

We base our implementation off of the Grover model⁴ and modify their codebase to match the language modeling training objective of GPT-2. Since their model was trained on a similarly large corpus, much of the code and hyper-parameters proved readily reusable. We did not substantially change the hyper-parameters from Grover.

The cost of training the model from scratch using our code is about $50k. It’s important to note this figure is the estimated value of the cloud compute, and does not reflect the much smaller intrinsic costs involved (training the model is less if training on other less time-efficient, user-friendly compute resources).

There is a significant time-cost tradeoff, and slower training methods have considerably smaller costs, thus reducing the barrier to entry.

Dataset

The original paper provided minimal details on how the dataset was cleaned.

As in WebText³, we begin by parsing out all links from Reddit with more than 3 up-votes. We started with the Pushshift Reddit scrape⁵, a dataset containing a continuously updated collection of Reddit posts, comments, and related metadata. These links are then filtered to remove direct links to file-types unlikely to contain usable text or HTML (i.e. video files, PDFs, and CSS style files).

We also filter webpages to remove Wikipedia as it is used by various evaluation benchmarks and datasets. We were not able to determine if our filtering criteria matched OpenAI’s since this information was never released. Text was extracted from HTML pages using the Newspaper Python library, and then filtered for only English text using the fastText Python library⁶. Specifically we use the WhatTheLang python Wrapper⁷. We deduplicate documents using locally sensitive hashing (LSH)⁸ ⁹ ¹⁰. We hashed the documents into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed.

As a cleaning heuristic, documents with fewer than 128 tokens were removed from the dataset. These shorter documents tended to be lower quality, as determined by text coherence. We release this dataset as the OpenWebTextCorpus¹¹.

For encoding the dataset, we used the Binary Pattern Encoder¹² released with the small models from Radford et al.³

We used a modified version of the OpenWebText web-scraping codebase¹³ as a starting point for our dataset collection.

Errata

From the publicly released collection of 260k documents from WebText³, we find that all have a minimum byte-pair (BPE) encoding¹² length of 40, and a maximum of 1024. OpenWebText differs in that we set a lower bound for document length at 128 tokens (instead of BPE codes), and do not restrict the maximum document length. The original WebTextCorpus was released before these samples became available and therefore did not use the information for generating cleaning heuristics.

We made multiple attempts to contact Radford et al.³ to clarify evaluation and model details, but were ultimately unsuccessful.

Results

Despite the differences in our training distribution, we do report similar perplexities over most datasets.

Samples

Prompt: “Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!”

Output: