In the previous blog posts we played with a large multi-gigabyte dataset. This 34 GB dataset is based on stackoverflow.com data. A couple days ago I found another great large dataset. This is a two terabyte snapshot from Reddit website. This dataset is perfect for text mining and NLP experimentation.

The image was taken from this web page.

1. Two terabytes data set

The full dataset contains two terabytes of data in JSON format. Thank you for Stuck_In_the_Matrix who created this dataset! The compressed version is 250 GB. You can find this dataset here in Reddit. You should use torrent to download this compressed data.

Additionally, you might find a 32 gb subset of this data in Kaggle website in SQLite format here. Also, you can play with the data online through R or Python in the Kaggle competition.

2. Easy to use 16 gigabytes subset

To simplify the process of working with this data, I created a subset of this data in plain text TSV format (tab separated values) here in my dropbox folder (updated, old Mac OS compatable only archive is here). The file contains the copy of the Kaggle subset. File size is 16GB uncompressed (yes, it is 2 times smaller than the Kaggel file because of plain text format without indexes) and 6.6GB in archive.

SQLite code for converting the Kaggle file to a plain text:

Note that I replace all tabs (X’09') and newlines (X’0A’) to spaces for all text columns. Please let me know if you know how to combine two character replacement to one operations.

3. Read data in Spark

Conclusion

Today is not easy to find great and interesting dataset for testing, training and research. So, let's collect some interesting datasets. Please share with the community your newly found information.

P.S.

I looked into the licensing of this dataset. The dataset publisher Stuck_In_the_Matrix just published the dataset and provided description and links to the torrent directly in the Reddit website. Please note that Reddit sponsors the Kaggle competition with this dataset. It appears that we may play with the dataset for non-business related purposes.