I think I have been lucky that several of the projects I been worked on have exposed me to having to manage large volumes of data. The largest dataset was probably at MailChannels, though Livedoor.com also had some sizeable data for their books store and department store. Most of the pain with Livedoor’s data was from it being in Japanese. Other than that, it was pretty static. This was similar to the data I worked with at the BBC. You would be surprised at how much data can be involved with a single episode of a TV show. With any in-house generated data the update size and frequency is much less dramatic, even if the data is being regularly pumped in from 3rd parties.

Those Humans

The real fun comes when the public (that’s you guys) are generating data to be pumped into the system. MailChannels’ work with email, which is human generated (lies! 95% is actually from spambots). Humans are unpredictable. They suddenly all get excited about the same thing at the same time, they are demanding, impatient, and smell funny. The latter will not be so much of problem to you, but the other points will if you intend to open your doors to user generated information.

Need More Humans

Opening your doors will not necessarily bring you the large volumes of data you require. In the beginning it will be a bit draughty, and you might consider closing those doors. What you need is more of those human eyeballs looking through their glowing screens at your data collectors. Making that a reality is the hard part. If you do get to that point, then you will probably not have “getting experience with large datasets” at the fore-front of your mind. Like sex, working with large datasets is most important to those not working with large datasets. So we will look elsewhere.

Data Hunting

The Internet is really just one giant bucket of data soup. Sure, we all just see the blonde, brunette or redhead, but it’s actually just green squiggly codes raining down in uniform vertical lines behind the scenes. What you need to figure out is how to get those streams of data that are flying around in the tubes of the Internet to be directed through your tube, into your MongoDB NoSQL database, Solr search engine, Hadoop distributed file-system or Cassandra cluster.

“All I see now is blonde, brunette, redhead”

– Cypher, The Matrix

Data Sources

The trouble with most sources of data is that they are owned and the data is copyrighted or proprietary. You can scrape websites and, if you fly under the radar, you will get a dataset. Although, if you want a large dataset, then it will take a lot of scraping. Instead, you should look for data that you can acquire more efficiently and, hopefully, legally. If nothing else, collecting the data in a legal manner will help you sleep better at night and you have a chance at going on to use that data to build something useful.

Here’s a list of places that have data available, provided by my good friend, Geoff Webb.

Some others I think are worth looking at are Wikipedia, Freebase and DBpedia. Freebase pulls its data from Wikipedia on a regular basis, as well as from TVRage, Metacritic, Bloomberg and CorpWatch. DBpedia also pulls data from Wikipedia, as well as YAGO, Wordnet and other sources.

You can download a dump of the Freebase dataset here. More information on the these dump files can be found here.

Make It Yourself

The quickest way I’ve found to get a good feed of data is to generate it. Create an algorithm that simulates mass user sign-up, 10,000 tweets per second, or vast amounts of logging data. Generating timestamps is easy enough. Use a dictionary of words for the content. Download War And Peace, Alice In Wonderland or any other book that is now out of copyright if you need real strings of words for your fake tweets.

Get Loopy

I have built a load-testing system, in which I recorded one days worth of data coming into the production system, then played it back at high-speed on a loop into my load-testing system. I could simulate a hundred times throughput or more. This could also be done with a smaller amount of data. If you can randomize the data a little so it’s slightly different each time around the loop, then even better.

Play In The Clouds

I would recommend you do not download the data to your home or work machine. Fire up an Amazon EC2 machine and store you data on a EBS volume. You’ll be able to turn it on and off so that you are only paying for the machine when you have time to play with it. The bandwidth and speed you can download that data will be much quickly. If you want to scale the data across a small cluster of machines it will be easy to setup and tear-down. The data will be in the cloud, so if you do get an idea of what to do with it then it will not take you two weeks to upload that data from your home machine to the cloud. It will already be there. On Amazon EC2 you can play with the same data on a cluster of small machines with limited RAM, or on a “extra large” machine with up to 64Gb of RAM. You can try all these different configurations very quickly and inexpensively. I mention Amazon EC2, because that is where my experience is, but the same applies for Rackspace and other cloud infrastructure providers.

Conclusion

There are data sources out there, but which data source you choose depends on which technology you wish to get experience working with. The experience should be of the technologies you are using, rather than what the data is. Certain datasets pair better with certain technologies. Simulating the data can be another approach. You just need a clever way of generating and randomizing your fake data. Thirdly, you can use a hybrid approach. Take real data and replay it on a loop, randomizing it as it goes through. Simulating the Twitter fire-hose should not be too hard, should it?

Please Leave A Comment

If you have other links to sources of good datasets, have any code for simulating datasets or any ideas on this topic, then please leave a comment on this blog post. I will answer any questions as best as I can and intend to write much more on this topic in future blogposts. Your feedback will help guide the content of these posts.

Related Posts On This Blog