Unlike every other prior World Cups where I would normally be glued to the TV screen and various sites, this time as the FIFA world cup kicks off this year I was about to do things a little differently. I wanted to do a small controlled real-time analytics experiment with MongoDB and MapReduce using the real-time twitter data. Yes, that would mean I miss most of the real-time World Cup fun, but I was still excited with the opportunity of playing with one of the biggest real-time data sets I could lay my hands on. This post and the others (hopefully) in the series are a chronological document of the experiment and results along with some instructions, this is no way intended to be a comprehensive tutorial of any sort. Also a good time to add a disclaimer that All the data collected and mined during as part of this process is purely used for testing only. There is no commercial use intended.

The End Result

Prior to this, I have worked with other big data architectures using Hadoop and the HDFS file system. The end result of this experiment is to do real time analytics using MapReduce on MongoDB using consumer grade hardware. If you are interested, the actual hardware used in this experiment is mid 2010 iMac with 16GB RAM, 3.06 GHz Intel Core i3 processor, normal HDDs and multiple firewire connected external HDD to store the data dumps. All the mining and analytics was supposed to run in the background while I continued with my day job on the same machine.

A performance snapshop of my desktop

What I Mined

I knew FIFA had officially endorsed the below hashtags for following the World Cup dates across social networks.

English: #worldcup and #Brazil2014

Spanish: #Brasil2014

French: #Brésil2014 and #CM2014

German: #wm2014

Portuguese: #Copa2014 and #SorteioCopa2014

Arabic: 2014_البرازيل# + العالم_كأس#

For the opener game, I just tracked one of them, the “#WorldCup”. As I expected the number of tweets were pretty high, so I narrowed down my mining to 30 mins before the start of the game to until 30 mins after the end of the game. What I did not expect was the system performance and the space utilization was quite less than what I have experienced in the past. So for the upcoming games, my plan is to track all the official FIFA hashtags. Of course these are early days to draw any conclusive comparisons with HDFS, but the trend seems to be there. Also for now I am restricting all the mining to twitter feeds only. I may add in other networks as well, but I believe I have more than enough data to play with from twitter alone.

The Technology

My primary consideration while choosing the technology stack for this experiment was to focus on both open source options as well as something that can easily run on consumer grade hardware.

Homebrew Or your favorite package manager

Ruby 2.1.2 (with gems twitterstream, twitter & mongo ) — This is optional & you may choose to use the language of your choice to do the mining

MongoDB 2.6

OS X 10.9 (Mavericks with XCode)

Of course, the technology stack is a kind of a live stack, I expect to add in a few more technologies as the WorldCup progresses and we start to get our hands really dirty with the analytics. So please keep in mind that this is just a starter stack.

The Setup

Assuming you have homebrew and Ruby already setup, here let’s focus on the high-level setups needed to get MongoDB up and running for your environment.

Installation — The easiest way to install MongoDB on OS X is via homebrew. Just make sure your packages and Xcode are updated before you install.

$ brew update $ brew install mongodb

If you get any missing packages or permission errors, fix them and retry the installation.

Configure — Before you do anything further, let’s spend a moment to understand the architecture of the installation. By default homebrew will install MongoDB at the following location on OS X

/usr/local/Cellar/mongodb

and the main configuration file here

/usr/local/etc/mongod.conf

Let’s edit this to specify the location for the DB collection and log files. In my case I used an external firewire connect HDD location

$ vi /usr/local/etc/mongod.conf

Update the following to your locations

dbpath = /Volumes/Iomega_HDD/mongodb/data

logpath = /Volumes/Iomega_HDD/mongodb/data/mongo.log

Now you are good to start mongoDB using

$ mongod —config /usr/local/etc/mongod.conf

Feel free to run this as a background process

Next, test the installation by logging into the MongoDB shell. In a new tab or terminal window you can invoke the shell as below

$ mongo

>

You can run MongoDB shell commands directly here, for instance to shutdown the server, you could do the below

>use admin switched to db admin >db.shutdownServer() >quit()

The Mining

There are many ways to mine data from twitter using the developer API’s provided by twitter. My familiarity with Ruby, inclined me to use it as my main programming language for data mining. Another reason why I use ruby for tasks similar to this one is that you can actually do a lot more with a lot less code. I also see no point in reinventing when you have stable gems out there that get the job done for you. I will post a link of the actual once I clean it up and push it to a public repository, github.

Early Stage Analytics

The mined data in itself is pretty boring. So let’s start to have some fun with it here. If you are coming for a RDBMS world (like me), you will need some time to grasp the difference between a NoSQL DB and a relational database, The most important one, there are no schemas in NoSQL. Everything is stored as a collection and you manipulate a collection.

Before we can manipulate the collection we need to find out the collection structure. Couple of different ways to do this, here is a easy one.

Login in to the MongoDB shell & switch to your collection.

$ mongo

>use fifatweetmine01

switched to db fifatweetmine01

You can also check your mining status, by querying the count of the collection

>db.fifatweets.count();

Get the structure of your collection by querying for the first record.

>db.fifatweets.findOne();

Should return something similar ( truncating the output for space)

{ “_id” : ObjectId(“539b467174a14b0601000002"), “created_at” : “Fri Jun 13 18:43:55 +0000 2014", “id” : NumberLong(“477521766307618816"), “id_str” : “477521766307618816", “text” : “#Brasil2014", “source” : “<a href=\”http://twitter.com/download/iphone\" rel=\”nofollow\”>Twitter for iPhone</a>”, “truncated” : false, “in_reply_to_status_id” : null, “in_reply_to_status_id_str” : null, “in_reply_to_user_id” : null, “in_reply_to_user_id_str” : null, “in_reply_to_screen_name” : null, “user” : {

“id” : 621109670,

“id_str” : “621109670",....

Now that you know the structure and keys for your document collection, you can actually run some preliminary analytics

For example to find out the number of tweets from verified verified accounts (apparent celebrities) you could do the following

> db.runCommand( { count:’fifatweets’, … query: { ‘user.verified’: true } … } ) { “n” : 2824, “ok” : 1 }

2824 is the answer you were looking for.

As another example to get the number of tweets that contained “Brasil”

> db.runCommand( { count:’fifatweets’, … query: { “text”: /Brasil/ } … } ) { “n” : 85870, “ok” : 1 }

These are some basic queries, but they will get you to a level where you are confident about the data collection process. In the upcoming posts I will try to get into the next level of analytics. I also do expect the data to grow significantly as the world cup warms up. Should be a fun journey.