OpenAI required around 40gb of high quality text corpus for training GPT2. While Common Crawl provides the scale necessary for modern language models, the quality is unreliable. Manual curation of Common Crawl is always an option, albeit an expensive one. Thankfully Reddit provides decentralized curation by design, and this became the key innovation for the WebText dataset.
The generation of WebText can be summarized as:
- Scrape URLs from all Reddit submissions up to December 2017 with 3 or higher score.
- Deduplicate scraped content based on URL
- Exclude wikipedia - OpenAI already had a separate Wikipedia dataset
- Deduplicate remaining content using undisclosed "heuristic based cleaning". This includes removal of non-english web pages.
Neither the resulting corpus or generation source code was made public, inspiring Aaron Gokaslan and Vanya Cohen to create the OpenWebTextCorpus.
OpenWebTextCorpus is an open source reproduction of WebText, reifying the "heuristic based cleaning" stage with fuzzy deduplication and enforcing a minimum token length. For content based de-duplication they used local-sensitivity-hashing (LSH) with minhash on sets of 5-grams at the document level. Documents were then tokenized and any with less then 128 tokens were removed. After all processing there remained 40GB of text across 8,013,769 documents.
The original code for OpenWebTextCorpus unavailable at this time, but there are several popular repositories that cover the pipeline to various degrees.
Our primary goals for the corpus are:
- More data! Coverage of the original OpenWebTextCorpus ended at December 2017.
- Include all languages, providing metadata for easy filtering
- Provide several versions of the generated corpus for differing user requirements. Both versions will be broken up by month and frozen, with future months available once PushShift submission dumps become available.
- Raw version containing all scraped pages with associated Reddit submission metadata
- Plug and play version based on submissions of minimum 3 score with content based fuzzy de-duplication
- Provide full source code for all stages of the pipeline including deduplication.