A long-standing joke about the Hadoop ecosystem is that if you don't like an API for a particular system, wait five minutes and two new Apache projects will spring up with shiny new APIs to learn.

It's a lot to keep up with. Worse, it leads to a lot of work migrating to different projects merely to keep current. "We've implemented our streaming solution in Storm! Now we've redone it in Spark! We're currently undergoing a rewrite of the core in Apache Flink (or Apex)! … and we've forgotten what business case we were attempting to solve in the first place."

Enter Apache Beam, a new project that attempts to unify data processing frameworks with a core API, allowing easy portability between execution engines.

Now, I know what you're thinking about the idea of throwing another API into the mix. But Beam has a strong heritage. It comes from Google and its research on the Millwheel and FlumeJava papers, as well as operational experience in the years following their publication. It defines a somewhat familiar directed acyclic graph data processing engine with the capability of handling unbounded streams of data where out-of-order delivery is the norm rather than the exception.

But wait, I hear some of you cry. Isn’t that Google Cloud Dataflow? Yes! And no. Google Cloud Dataflow is a fully managed service where you write applications using the Dataflow SDK and submit them to run on Google’s servers. Apache Beam, on the other hand, is simply the Dataflow SDK and a set of "runners" that map the SDK primitives to a particular execution engine. Yes, you can run Apache Beam applications on Google Cloud Dataflow, but you can also use Apache Spark or Apache Flink with little to no changes in your code.

Ride with Apache Beam

There are four principal concepts of the Apache Beam SDK:

Pipeline: If you've worked with Spark, this is somewhat analogous to the SparkContext. All your operations will begin with the pipeline object, and you'll use it to build up data streams from input sources, apply transformations, and write the results out to an output sink.

If you've worked with Spark, this is somewhat analogous to the SparkContext. All your operations will begin with the pipeline object, and you'll use it to build up data streams from input sources, apply transformations, and write the results out to an output sink. PCollection: PCollections are similar to Spark's Resilient Distributed Dataset (RDD) primitive, in that they contain a potentially unbounded stream of data. These are built from pulling information from the input sources, then applying transformations.

PCollections are similar to Spark's Resilient Distributed Dataset (RDD) primitive, in that they contain a potentially unbounded stream of data. These are built from pulling information from the input sources, then applying transformations. Transforms: A processing step that operates on a PCollection to perform data manipulation. A typical pipeline will likely have multiple transforms operating on an input source (for example, converting a set of incoming strings of log entries into a key/value pair, where the key is an IP address and the value is the log message). The Beam SDK comes with a series of standard aggregations built in, and of course, you can define your own for your own processing needs.

A processing step that operates on a PCollection to perform data manipulation. A typical pipeline will likely have multiple transforms operating on an input source (for example, converting a set of incoming strings of log entries into a key/value pair, where the key is an IP address and the value is the log message). The Beam SDK comes with a series of standard aggregations built in, and of course, you can define your own for your own processing needs. I/O sources and sinks: Lastly, sources and sinks provide input and output endpoints for your data.

Let’s look at a complete Beam program. For this, we’ll use the still-quite-experimental Python SDK and the complete text of Shakespeare’s "King Lear":

import re

import google.cloud.dataflow as df

p = df.Pipeline('DirectPipelineRunner')

(p

| df.Read('read',

df.io.TextFileSource(

'gs://dataflow-samples/shakespeare/kinglear.txt'))

| df.FlatMap('split', lambda x: re.findall(r'\w+', x))

| df.combiners.Count.PerElement('count words')

| df.Write('write', df.io.TextFileSink('./results')))

p.run()

After importing the regular expression and Dataflow libraries, we construct a Pipeline object and pass it the runner that we wish to use (in this case, we're using DirectPipelineRunner , which is the local test runner).

From there, we read from a text file (with a location pointing to a Google Cloud Storage bucket) and perform two transformations. The first is flatMap , which we pass a regular expression into in order to break each string up into words -- and return a PCollection of all the separate words in "King Lear." Then we apply the built-in Count operation to do our word count.

The final part of the pipeline writes the results of the Count operation to disk. Once the pipeline is defined, it is invoked with the run() method. In this case, the pipeline is submitted to the local test runner, but by changing the runner type, we could submit to Google Cloud Dataflow, Flink, Spark, or any other runner available to Apache Beam.

Runners dial zero

Once we have the application ready, it can be submitted to run on Google Cloud Dataflow with no trouble, as it is simply using the Dataflow SDK.

The idea is that runners will be provided for other execution engines. Beam currently includes runners supplied by DataArtisans and Cloudera for Apache Flink and Apache Spark. This is where some of the current wrinkles of Beam come into play because the Dataflow model does not always map easily onto other platforms.

A capability matrix available on the Beam website shows you which features are and are not supported by the runners. In particular, there are extra hoops you need to jump through in your code to get the application working on the Spark runner. It’s only a few lines of extra code, but it isn’t a seamless transition.

It's also interesting to note that the Spark runner is currently implemented using Spark's RDD primitive rather than DataFrames. As this bypasses Spark's Catalyst optimizer, it's almost certain right now that a Beam job running on Spark will be slower than running a DataFrame version. I imagine this will change when Spark 2.0 is released, but it's definitely a limitation of the Spark runner over and above what's presented in the capability matrix.

At the moment, Beam only includes runners for Google Cloud Dataflow, Apache Spark, Apache Flink, and a local runner for testing purposes -- but there's talk of creating runners for frameworks like Storm and MapReduce. In the case of MapReduce, any eventual runner will be able to support a subset of the what Apache Beam provides, as it can work only with what the underlying system provides. (No streaming for you, MapReduce!)

Grand ambitions

Apache Beam is an incredibly ambitious project. Its ultimate goal is to unify all the data processing engines under one API -- and make it trivially easy to migrate, say, your Beam application running on a self-hosted Flink cluster over to Google Cloud Dataflow.

As somebody who has to develop these applications, this is great. It's clear that Google has spent years refining the Beam model to cover most of the data processing patterns that many of us will need to implement. Note, however, that Beam is currently an "incubating" Apache project, so you'll want to exercise caution before putting it into production. But it's worth keeping a close eye on Beam as it incorporates more runners -- and ports the Beam SDK to more languages.