OpenWebText2 is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released.

In case you haven't heard of WebText, the core principle is extracting URLs from reddit submissions, scraping the URLs, then performing filtering & deduplication. See Background for more information.

Download Plug and Play Version

This version has already been cleaned for you:

  • Deduplicated by URL
  • Filtered by minimum combined reddit score 3
  • Deduplicated at document level with MinHashLSH.

17,103,059 documents
65.86 GB uncompressed text
28 GB compressed including text and metadata

Download Raw Scrapes Version

Only deduplicated by URL.

69,547,149 documents
193.89gb uncompressed text.
79gb compressed including text and metadata

Using The Data

The data is stored using lm_dataformat. We use a slightly modified version to allow file peeking for tqdm progress bars: utils/archiver.py. Be sure to call read_jsonl with get_meta=True as both versions contain useful metadata for each document, including several original Reddit fields.

import glob
import os
import math

import tqdm

from utils.archiver import Reader

document_count = 0
total_text_size = 0
dataset_directory = "PATH_TO_FILES"
files = glob.glob(os.path.join(dataset_directory, "*jsonl.zst"))
for file_path in tqdm.tqdm(files, dynamic_ncols=True):
    reader = Reader()
    for document, metadata in reader.read_jsonl(file_path, get_meta=True):
        document_count += 1
        total_text_size += len(document)

billion = math.pow(10, 9)
print(f"Total Document Count: {document_count:,}")
print(f"Total Uncompressed Text Size: {(total_text_size / billion):.2f} GB")

Alternatively checkout The-Pile, which acts as an aggregator/dataloader for multiple text datasets. It allows you to configure your total data size requirement, along with the desired weighting for each subset. Once configured, you get a randomized stream of documents, allowing easy feeding to your language model.

Cite as

    title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
    author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
    journal={arXiv preprint arXiv:2101.00027},