16th June, 2015

Earlier this week I posted about how the cloud can help remove the constraints of working with data productively. When it comes to big data tools or techniques, there are three variables that impact productivity, that is, the ability to get real work done efficiently:

Speed of provisioning: the faster the better. Tools which are quick to setup and access are beneficial in two ways: firstly, they are easy to evaluate to see if they are a good fit since the investment required is minimal; and secondly, those same speed benefits pay off on every subsequent usage.

Resource fit: all the speed in the world doesn’t do you much good if you don’t have sufficient resources for the task in hand. Likewise, having resources available but being unable to access them or put them to work is frustrating (and wasteful). A range of resource sizes and shapes helps to create a perfect fit between your workload and the resourcing, so you don’t end up trying to fit a square peg into a round hole.

Iterative by default: Building apps, big data or otherwise, is an iterative process with benefits from low cost experimentation. The ability to be able to rapidly and easily build, test, refine, reject or evolve the logic and architecture is hugely valuable. The perfect fit of resources, quickly available isn’t all that useful if it’s effectively frozen in time as your requirements change from day to day.

Partnering infrastructure services which aim to meet these requirements with software that is designed from the outset to support (and in many cases, accelerate) the iterative nature of building applications is greater than the sum of its parts in terms of actually getting real work done.

Enter Spark

Apache Spark is one such a tool. If you’re unfamiliar, Spark uses a mixture of in-memory data storage (so called, resilient distributed data), graph based execution and a programming model designed to be easy to use. The result is a highly productive environment for data engineering and scientist to crunch data at scale (in some cases, 10x to 100x faster than Hadoop map/reduce).

Today, at the Spark Summing in San Francisco, it was a pleasure to announce that we’re coupling the speed of provisioning and broad resource mix of Amazon EMR with the iterative-friendly programming model of Apache Spark. More on the AWS blog.

Spark has already been put into production using EMR with folks such as Yelp, the Washington Post and Hearst, and I’m excited to see how better support in the console and EMR APIs help bring Spark to a broader audience.