This post is the result of an effort to show to a coworker the infinite number of possibilities that Apache Spark along with other IoT technologies have to offer.

I wanted to show them how to do some simple, yet very interesting analytics that will help him to solve real problems by analyzing specific areas of a social network. Using a subset of the endless twitter’s stream looks as the perfect choice since it has everything we need, an endless and continuous datasource (streams) ready to be explored.

Our Goal

The point of this demonstration is to help my coworker to get the insights he needs while showing the power of the Apache Spark by using its streaming capabilities and concise API.

Spark Streaming, Minimized

Spark Streaming is very well explained here so we are going to skip some of the details about the Streaming API and go directly to use it (still we are going to explain the steps to get what we need).

Setting Up Our App

Let’s see how to prepare our app before doing anything else.

A very standard initial configuration:

In here, we have created the Spark Context sc, set the log level to WARN to eliminate the noisy log Spark generates. We also created a Streaming Context ssc using sc. Then we need to set up our Twitter credentials (before doing this we need to follow these steps) that we can get from the Twitter website. Now the real fun starts.

What is trending right now

It is easy to know what is trending on Twitter at this precise moment. It is just matter of counting the appearances of each tag on the stream. Let’s see how Spark allows us to do this operation.

First, we got the tags from the tweets, count how many times it (a tag) appears and sort them by the count. After that, we persist the result in order to point Splunk (or any other tool for this matter) to it. We could build some interesting dashboards using this information so we can track the most trending hashtags. Based on this information my coworker could create campaigns and use these popular tags to attract a bigger audience.

Analyzing Tweets

Now, we want to add functionality to get an overall opinion of what people think about a set of topics. For the sake of this example, let’s say that we want to know the sentiment of tweets about BigData and Food, two very unrelated topics.

There are several APIs for analyzing sentiments from tweets, but we are going to use an interesting library from The Stanford Natural Language Processing Group in order extract the corresponding sentiments.

In our build.sbt file we need to add the corresponding dependencies.

Now, we need to select only those tweets we really care about by filtering the stream using certain hashtag (#). This filtering is quite easy thanks to unified Spark API.

Let’s see how.

Here, we get all tags in each tweet, checking that it has been tagged with #bigdata and #food.

Once we have our tweets, extracting the corresponding sentiment is quite easy. Let’s define a function that extract the sentiment from the tweet’s content so we could plug it in in our pipeline.

We are going to use this function assuming it does what it should and we will put its implementation at the end since it's not the focus of this post. In order to give an idea of how it works, let see some tests around it.

These tests should be enough to show how detectSentiment works.

Let’s see an example.

data represents a DStream of tweets we want, the associated sentiment and the hashtags within the tweet (here we should find the tags we used to filter).

SQL Interoperability

Now, we want to cross reference the sentiment data with an external dataset that we can query using SQL. For my friend it makes a lot of sense to be able to join the twitter stream with his other dataset.

Let’s take a look at how we could achieve this.

We have transformed our stream into a different representation (a DataFrame) which is also backed by all Spark concepts (Resilient, distributed, very fast) and exposed it as a table so he — my friend — can use his beloved SQL to query different sources.

The table sentiment (that we defined from our DataFrame) will be queried as any other table in his system. Another possibility is that we could query other data sources (Cassandra, Xmls, our own binary formatted files) using Spark SQL and cross them with the stream.

Please, find more information about this topic here and here.

An example of querying a DataFrame is shown next.

Windowed Operations

Spark Streaming has the ability to look back in the stream, a functionality most streaming engines lacks of or are very hard to implement.

In order to implement windowed operation it is recommended to checkpoint the stream, but this is an easy task. Please, find more information about this here.

A very small example of these kind of operations is demonstrated next.

Conclusions

Even though our examples are quite simple, we have solved a real life problem, we have the ability to identify trending topics on Twitter which helps us targeting and increase audience. At the same time, we have given to the possibility of accessing to different data sets using a single set of tools such as SQL.

Solving real life problems in a simple way is what most people actually want and Spark once again helps us to achieve this.

Quite interesting results came back from #bigdata and #food at the same time, maybe people tweet about big data at lunch time, who knows?