The Technology Highlight series covers the leading open source tools in the field of data engineering. In the first post of this series, we discuss how Insight Fellows have used Apache Spark — one of the most popular emerging technologies for processing large-scale data.

At the Insight Data Engineering Fellows Program, software engineers and recent graduates learn cutting-edge, open source technologies by building data pipelines and platforms that can handle vast amounts of data. This gives our Fellows the opportunity to spend three weeks learning and comparing the latest technologies in the data engineering ecosystem to discover best practices and use cases for each tool. One of the most exciting technologies they have explored is the new data processing engine, Spark.

Beyond MapReduce

Unlike Hadoop MapReduce — which was designed back when storing massive amounts of data was only feasible with relatively slow hard disks on a distributed cluster of machines — Spark is optimized to process data using a cluster’s memory, which has become faster and cheaper in recent years. Additionally, Spark uses a general computational model that goes beyond MapReduce, allowing users to string together several types of operations rather than being constrained to pairs of Map and Reduce phases on data sets. These series of operations enable interactive queries, graph processing, and iterative machine learning algorithms that can be hundreds of times faster than corresponding approaches using MapReduce alone.

The resulting collections of data are known as Resilient Distributed Datasets (RDDs). RDDs are “distributed” because they share the processing tasks across the various machines in the cluster. They get their resiliency by tracking all of the transformations rather than copying the transformed data at every step. If there are errors, the data can be efficiently re-processed by tracing back the lineage of the transformations. With all these advantages, companies are quickly starting to transition away from Hadoop MapReduce, and our Fellows were eager to stay at the forefront of the field by building their projects with Spark.

Spark — A ‘functional’ technology

Our most recent batch of Fellows picked up Spark right at the outset of our program, beginning with an introduction from Spark contributor and Field Engineer at Databricks, Chris Fregly. They were able to quickly start using the Python and Java API’s, and many of them even learned Spark’s native language, Scala, to take advantage of the latest features. For example, one Fellow built a platform for collecting and analyzing the comments and real-time tweets on the most popular news articles using Spark and Scala’s functional programming style.

Aside from the processing power as a solitary tool, Fellows selected Spark because it is compatible with many of the most popular technologies in the data engineering ecosystem. For example, one of the Fellows constructed a photo-sharing system that aggregates photos by popularity for different locations, topics, and time periods. The fact that Spark integrates with the Hadoop Distributed File System (HDFS) and databases like HBase and Elasticsearch allows data engineers to continue using the tools specialized for particular use cases like geographic search and time-series.

One framework, multiple tools

The Fellows also chose Spark because it offers a unified framework for a variety of different levels of abstraction. For example, another Fellow used the built-in Spark SQL library, which provides many of the high-level features and operations from relational tools like MySQL but with the ability to easily scale to larger volumes of data distributed across a cluster. This allowed him to build a platform for easily querying millions of replay files from the real-time strategy game Starcraft II in order to investigate how different playing styles performed on different in-game maps.

Spark Streaming and Lambda Architecture

Many of the Fellows also used the Spark Streaming library to handle large amounts of real-time data. Since Spark can complete operations on data very quickly using in-memory computing, Spark Streaming can split a stream of incoming data into small micro-batches and process each of them on the fly.

One Fellow used this feature to build a platform for analyzing real-time messages from pet owners all over the country, which was load-tested with millions of users. This system used the pre-built Kafka and Cassandra connectors, which allow Spark to consume streaming data and seamlessly push RDDs to a database with simple functions.

Spark Streaming was used by another Fellow to create a real-time fantasy football platform that tracks and updates points for five million users. She accomplished this by joining streaming data from NFL games with a large static dataset containing information for the teams in the fantasy league. She also implemented this system using Lambda Architecture, which provides reliable, real-time access to computations on large datasets by simultaneously calculating results using a batch and speed layer. Since Spark and Spark Streaming operate on similar models, she found it easy to adapt the code that processes static data to work for the streaming data as well. She also created a custom user-defined function (UDF) to bundle together several small files from Spark Streaming for efficient storage on HDFS.

Another Fellow followed Lambda Architecture with Spark to build a pipeline for tracking tweets about stocks and analyzing correlations between the stock price and the sentiment of the tweets. In addition, he came up with creative solutions to work around the fact that the time-to-live (TTL) feature has not been supported yet for the stable release of the Spark-Cassandra connector. To account for this, he designed a unique data model to effectively support TTL by ignoring older data as it expires past a given age.

Incremental progress in GraphX

A particularly interesting project was an in-depth look at the Spark GraphX module. While the current algorithm in the GraphX module can efficiently find the shortest path for static graphs, it simply re-calculates the shortest path for the entire graph whenever it changes, even slightly. As a motivating example, consider a member of a social network searching for a friend by their name — the most relevant person is often the “nearest” in the graph network, and can be found by finding the shortest path. Once a new friendship is formed, the search should update the shortest distance without processing all of the unchanged parts of the graph.

This Fellow developed a new package on top of the existing graph framework to efficiently process incrementally changing graphs. By tracking the state of the graph and only updating the relevant shortest paths for the parts that have changed, the algorithm can save up to 90 percent of the computations for randomly changing graphs. Once further features and optimizations are added, this project will be open-sourced and submitted to the Spark packages community.

Looking forward

Overall, the Fellows had a great experience learning and using Spark in their projects. While it can be difficult to pick up an emerging tool, their experience demonstrates that the Spark framework is an approachable and promising technology that still has room for meaningful contributions. We believe it will become one of the dominant tool for processing large-scale data in the next few years, and our Fellows will definitely continue to explore it in future sessions.