Storm is a big data framework that is similar to Hadoop but fine-tuned to handle unbounded data streams. In this installment of Open source Java projects, learn how Storm builds on the lessons and success of Hadoop to deliver massive amounts of data in realtime, then dive into Storm's API with a small demonstration app.

When I wrote my recent Open source Java projects introduction to Github I noted that Storm, a project submitted by developer Nathan Marz, was among GitHub's most watched Java repositories. Curious, I decided to learn more about Storm and why it was causing a stir in the GitHub Java developer community.

What I found is that Storm is a big data processing system similar to Hadoop in its basic technology architecture, but tuned for a different set of use cases. Whereas Hadoop targets batch processing, Storm is an always-active service that receives and processes unbound streams of data. Like Hadoop, Storm is a distributed system that offers massive scalability for applications that store and manipulate big data. Unlike Hadoop, it delivers that data instantaneously, in realtime.

Open source licensing Storm is a free and open source project. It is hosted on GitHub and available under the Eclipse Public License for use in both open source and proprietary software.

In this installment of the Open source Java projects series I introduce Storm. We'll start with an overview of Storm's architecture and use case scenarios. Then I'll walk through a demonstration of setting up a Storm development environment and building a simple application whose goal is to process prime numbers in realtime. You'll learn a bit about Storm and get your hands into its code, and you'll also get a little taste of its speed and versatility. If Storm is applicable to your application needs, you'll be ready for next steps after reading this article.

What is Storm?

Storm is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. In terms of categorical use cases, Storm is especially well suited to any of the following:

Realtime analytics

Online machine learning

Continuous computation

Distributed RPC

ETL

Perusing the Storm User group discussion forum, I found that Storm has been used in some very interesting real-world scenarios. Given a large database for an online auction site, for instance, Storm was used to view the top number of trending items in any category, in realtime. It has also been connected to the Twitter firehose and used to identify posting trends. Other potential uses for Storm include the following:

Watch live traffic to a website in order to understand user behavior and feed a recommendation engine.

Execute search operations in parallel (i.e., use parallel threads to search through different areas of a data set).

Compute video analytics, perform tagging, build tracks, and so forth.

See the Storm user group discussion forum for more real-world and speculative use cases for Storm.

Storm vs Hadoop

For many developers investigating Storm, the first question will be: How does it differ from Hadoop? The simple answer is that Storm analyzes realtime data while Hadoop analyzes offline data. In truth, the two frameworks complement one another more than they compete.

Hadoop provides its own file system (HDFS) and manages both data and code/tasks. It divides data into blocks and when a "job" executes, it pushes analysis code close to the data it is analyzing. This is how Hadoop avoids the overhead of network communication in loading data -- keeping the analysis code next to the data enables Hadoop to read it faster by orders of magnitude.

Hadoop's programming paradigm is the infamous MapReduce. Basically, Hadoop partitions data into chunks and passes those chunks to mappers that map keys to values, such as the hit count for a resource on your website. Reducers then assemble those mapped key/value pairs into a usable output. The MapReduce paradigm operates quite elegantly but is targeted at data analysis. In order to leverage all the power of Hadoop application data must be stored in the HDFS file system.

Storm solves a different problem altogether. Storm is interested in understanding things that are happening in realtime -- meaning right now -- and interpreting them. Storm does not have its own file system and its programming paradigm is quite a bit different from Hadoop's. Storm is all about obtaining chunks of data, known as spouts, from somewhere (like a Twitter feed or live web traffic to your site) and passing that data through various processing components, known as bolts. Storm's data processing mechanism is extremely fast and is meant to help you identify live trends as they are happening. Unlike Hadoop, Storm doesn't care what happened yesterday or last week.

Some use cases have shown that Storm and Hadoop can work beautifully together (see Resources). For instance, you might use Storm to dynamically adjust your advertising engine to respond to current user behavior, then use Hadoop to identify the long-term patterns in that behavior. The important point is that you don't have to choose between Storm and Hadoop; rather, work to understand the problem you are trying to solve and then choose the best tool for the job.

Storm spouts and bolts

Storm has its own vernacular, but if you've studied Hadoop and other distributed data processing systems you should find its basic architecture familiar.

At the highest level, Storm is comprised of topologies. A topology is a graph of computations -- each node contains processing logic and each path between nodes indicates how data should be passed between nodes.

Inside of toplogies you have networks of streams, which are unbounded sequences of tuples. Storm provides a mechanism to transform streams into new streams using spouts and bolts. Spouts generate streams, which can pull data from a site like Twitter or Facebook and then publish it in an abstract format. Bolts consume input streams, process them, and then optionally generate new streams.

A stream can be simple and consumed by a single bolt or it could be complex, require multiple streams, and hence require multiple bolts. A bolt can do most anything, including run functions, filter tuples, perform stream aggregation, join streams, or even execute calls to a database.

Figure 1 shows the relationship between spouts and bolts within a topology.

Figure 1. Spouts and Bolts within a Storm topology (click to enlarge)

As Figure 1 illustrates, a Storm topology can have multiple spouts and each spout can have one or more bolts. Additionally, each bolt can exist on its own or it could have additional bolts that listen to it. For example, you might have a spout that observed user behavior on your site and a bolt that used a database field to categorize the types of pages that users were accessing. Another bolt might handle article views and add a tick to a recommendation engine for a unique view on each type of article. You would simply configure the relationships between spouts and bolts to match your own business logic and application demands.

As previously mentioned, Storm's data model is represented by tuples. A tuple is a named list of values of any type. Storm supports all primitive types, Strings, and byte-arrays and you can build your own serializer if you want to use your own object types. Your spouts will "emit" tuples and your bolts will consume them. Your bolts may also emit tuples if their output is destined to be processed by another bolt downstream. Basically, emitting tuples is the mechanism for passing data from a spout to a bolt, or from a bolt to another bolt.

A Storm cluster

I've covered the main vernacular for the structure of a Storm-based application, but deploying to a Storm cluster involves another set of terms.

A Storm cluster is somewhat similar to Hadoop clusters, but while a Hadoop cluster runs map-reduce jobs, Storm runs topologies. As mentioned earlier, one of the primary differences between map-reduce jobs and topologies is that map-reduce jobs eventually end, while topologies are destined to run until you explicitly kill them. Storm clusters define two types of nodes:

Master node : This node runs a daemon process called Nimbus. Nimbus is responsible for distributing code across the cluster, assigning tasks to machines, and monitoring the success and failure of units of work.

: This node runs a daemon process called Nimbus. Nimbus is responsible for distributing code across the cluster, assigning tasks to machines, and monitoring the success and failure of units of work. Worker nodes: These nodes run a daemon process called the Supervisor. A Supervisor is responsible for listening for work assignments for its machine. It then subsequently starts and stops worker processes. Each worker process executes a subset of a topology, so that the execution of a topology is spread across a multitude of worker processes running on a multitude of machines.

Storm and ZooKeeper

Sitting between the Nimbus and the various Supervisors is the Apache open source project ZooKeeper. ZooKeeper's goal is to enable highly reliable distributed coordination, mainly by acting as a centralized service for distributed cluster functionality.

Storm topologies are deployed to the Nimbus and then the Nimbus deploys spouts and bolts to Supervisors. When it comes time to execute spouts and bolts, the Nimbus communicates with the Supervisors by passing messages to Zookeepers. Zookeepers maintain all state for the topology, which allows the Nimbus and Supervisors to be fail-fast and stateless: if the Nimbus or Supervisor processes go down then the state of processing is not lost; work is reassigned to another Supervisor and processing continues. Then, when the Nimbus or a Supervisor is restarted, they simply rejoin the cluster and their capacity is added to the cluster.

The relationship between the Nimbus, Zookeepers, and Supervisors is shown in Figure 2.

Figure 2. Nimbus communicates with Supervisors through Zookeepers (click to enlarge)

Setting up a Storm development environment

Storm can run in one of two modes: local mode or distributed mode. In local mode, Storm executes topologies completely in-process by simulating worker nodes using threads. In distributed mode, it runs across a cluster of machines. We'll set up a local development environment as follows:

Download the latest release of Storm and decompress it to your local hard drive. Add the Storm bin directory to your PATH environment variable.

The bin directory includes the storm command, which is the command-line interface for interacting with a storm cluster. You can learn more about each command-line option from the Storm documentation. At a high-level, the storm command allows you to deploy your topologies to a storm cluster, as well as activate, deactivate, and kill topologies.

Creating a storm application in local mode is a great way to test out a Storm application before deploying it to a cluster, where the processing capacity will increase by orders of magnitude.

Storm demo: Crunching prime numbers

While you might not feel inclined to compute prime numbers for the rest of eternity, doing it once is a great demonstration of Storm's potential that is also easy to implement.

For this demo we'll create a NumberSpout that emits numbers, starting with 1 and incrementing until we kill the topology. To this spout we'll connect a PrimeNumberBolt that evaluates the number, which is passed in a Tuple , and prints the number out to the screen if it is prime. The NumberSpout and the PrimeNumberBolt are packaged together inside a PrimeNumberTopology class.

NumberSpout

Listing 1 shows the NumberSpout , which is responsible for emitting numbers, starting with 1 and incrementing until the spout is killed.

Listing 1. NumberSpout.java

package com.geekcap.storm; import backtype.storm.spout.SpoutOutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseRichSpout; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; import java.util.Map; public class NumberSpout extends BaseRichSpout { private SpoutOutputCollector collector; private static int currentNumber = 1; @Override public void open( Map conf, TopologyContext context, SpoutOutputCollector collector ) { this.collector = collector; } @Override public void nextTuple() { // Emit the next number collector.emit( new Values( new Integer( currentNumber++ ) ) ); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare( new Fields( "number" ) ); } }