Many folks are very exited about big data. They like play, explore, work and study this frontier. Most likely these folks either work with or would like to play with large amount of data (hundreds of gigabytes or even terabytes). But here’s the thing, it’s not easy to find a multi-gigabytes dataset. Usually, these kinds of datasets are needed for experimentating with new data processing framework such as Apache Spark or data streaming tools like Apache Kafka. In this blog post I will describe and provide a link to simple and a powerful multi-gigabytes stackoverflow data set.

1. Datasets for machine learning

Lots of sources exist for machine learning problems. Kaggle is the best source for these problems and they offer lots of datasets presented with examples of code. Most of these data sets are clean and ready to use in your machine learning experiments.

In a real data scientist’s life most likely you do not have the luxury of clean data and the size of the input data creates an additional big problem. University courses as well as online courses offer a limited viewpoint on data science and machine learning due to the fact they teach student to apply statistical and machine learning methods to a small amount of clean data. In reality, a data scientist spends the majority part of time by getting data and cleaning up that data. According to Hal Varian (Google’s chief economist) “the sexiest job of the 21st century” belongs to Statisticians (and I assume to Data Scientists). However, they perform “clean up” work most of the time.

In order to experiment with new data processing or data streaming tools, you need a large (larger than your computer can hold in memory) and an uncleaned datasets.

Large and uncleanrf datasets will allow you to get actual data processing or learn analytical skills. It turns out that this is not that easy to find.

2. Datasets for processing

Kdnuggets and Quora have pretty good lists of open repositories:

Most of these datasets from these lists are very small in size and for the most part, you need specific knowledge from a dataset specific business domain such as physics or healthcare. However, for learning and experimentation purposes, it would be nice to have a dataset from a well known business domain that all people are familiar with.

Social network data is the best because people understand these datasets and they have intuition about the data which is important in the analytic process. You might use a social network API to extract your data sets. Unfortunately, your data set is not the best for sharing your analytical results with other people. It would be great to find a common social network dataset with an open license. And I’ve found one!

3. Stackoverflow open dataset

Stackoverflow data set is the only social open dataset that I was able to find. Stackoverflow.com is a question and answers web site about programming. This web site is especially useful when you have to write a code in a language you are not familiar with. This well known approach is called — stackoverflow driven development or SDD. I believe all people from the high-tech industry are familiar with stackoverflow and many of them have an account for this web site.

Stack Exchange Company (owner of stackoverflow.com) publishes stackexchange dataset under an open creative common license. You might find the freshest dataset on this page:

The dataset contains all stackexchange data including stackoverflow and the overall size of the archive is 27 gigabytes. The size of the uncompressed data is more than 1 terabyte.

4. How to download and extract the dataset?

However, this dataset is not easy to get. First, you need to upload the archive of the entire dataset. Please note that the downloading speed is very slow. They recommend using a bittorrent client to download the archive but often it has some issues. Without the bittorent, I made 3 attempts and spent 2 days to download this archive. Next, you need to unzip the large archive. Finally, you need to unzip the subset of data that you need (like stackoverflow-Posts or travel.stackexchange) using the 7z compressor. If you don’t have the 7z compressor, you need to find and install it to your machine.

After you download the archive from https://archive.org/details/stackexchange extract all stackoverflow related archives and uncompress each of them (all archives which starts with stackovervlow.com):

stackovervlow.com-Posts.7z

stackovervlow.com-PostsHistory.7z

stackovervlow.com-Comments.7z

stackovervlow.com-Badges.7z

stackovervlow.com-PostLinks.7z

stackovervlow.com-Tags.7z

stackovervlow.com-Users.7z

stackovervlow.com-Votes.7z

As a result you will see a set of xml files with the same names.

5. How to use the dataset?

Let’s experiment with the dataset. The most interesting file is Posts.xml. This file contains 34Gb of uncompressed data, approximately 70% is Body text which is a text of questions from the web site. This amount of data, most likely, does not fit your memory. We might use an in-disk data manipulation or machine learning technology. This is a good chance to use Apache Spark and MLLib or your custom solution.

Let’s take a look how this stackoverflow question will look like in the file.

Stackowerflow example

In the file this post is presented by one single row. Note that because the text is HTML — the opening and closing p tags (<p> and </p>) are written as <p> and </p> respectively.

I’ll provide Apache Spark code examples with this data set in the next blog post. My scenario will include two parts: preparing data or data manipulation and machine learning part. Both of these part I’ll use multi-gigabytes dataset as an input.

Conclusion

Stackoverflow dataset (https://archive.org/details/stackexchange) is probably the simplest and most interesting open multi-gigabytes dataset you can find which fits machine learning, data processing scenarios and data streaming. Please share if you have any information about other simple open big dataset resources. This should help the community a lot.