ApacheCon is coming up, and within that massive conference there will be a glimmering gem: a forum dedicated to Spark.

Reynold Xin is organizing it, and he shared some valuable information with me about Apache Spark: what it is, why it's special, and what news they have to share this year.

The Spark Forum will have speakers from the Hive project, the Pig project, and the Sqoop project. Plus, two talks about Spark Streaming—one will be introductory, and the other developer-focused—and one about real-world data science using Spark. Xin says there's great synergy between many Apache Software Foundation (ASF) projects and Spark, so the forum will be an opportunity to see the progress made and share thoughts on the roadmaps of the projects.

Read more in this interview. See the schedule for Spark Forum.



What's Apache Spark?

Spark is a fast and general distributed data processing platform. With over 500 contributors, it is actually the most actively developed open source Big Data project. It is also the most active project at the Apache Software Foundation.

Why is it special?

Three major things set Spark apart from the previous generation tools:

1. Ease-of-use: Spark's design makes distributed data programming similar to single-node programming. Its API is much easier to use than other tools. The other factor that is pretty important is that developers can now unit-test their Spark applications out-of-the-box using standard unit test frameworks (e.g. JUnit, ScalaTests). It does substantially improve productivity.

2. Speed: Spark was initially designed to address the performance concerns with Hadoop MapReduce, and thus performance optimization is a continual focus of the project. As an example, Spark holds the current world record in 100 TB sorting. Many early users were also initially drawn to Spark because of the performance improvements over Hadoop MapReduce.

3. Versatility (or unification): Spark is incredibly versatile. You can run many different distributed computing paradigms on top of it, e.g. SQL, streaming, machine learning, graph computation. This means developers can consolidate their IT infrastructure and reduce the number of systems they need to learn and maintain. In addition, this also enables applications that were not possible before, such as integrating machine learning algorithms easily with live, streaming data.

Any news this year?

The Spark ecosystem has two major focuses this year:

1. Even-easier-of-use: Spark made distributed data processing substantially easier for engineers, however we want to make distributed data processing even easier for people that don't necessarily have rigorous computer science training, e.g. data scientists, statisticians. As a result, we are building high level APIs such as DataFrames and machine learning pipelines to further simplify distributed data processing. We want users that are already familiar with single-machine tools to be able to pick up distributed data processing as quickly as possible.

2. Platform APIs: Overtime, more and more projects are developing on top of Spark, and we see it as a unique runtime that can support a wide range of environments (e.g. public cloud vs private cloud, different storage systems, database systems, NoSQL stores). We are working towards standardizing the various interfaces Spark use to interact with external systems, so other projects can comfortably be building on top of Spark.

How is the Apache Big Data ecosystem building on top of Spark and standardizing on it?

Most of the Apache Software Foundation (ASF) big data projects are now building on top of Spark. For example, as you will see in the ApacheCon agenda, Hive, Pig, and Sqoop are now supporting using Spark as the computation engine. Many other projects are also providing interfaces to integrate with Spark.

Give us a bit on how Hadoop, Hive, Pig, and Sqoop relate to Spark.

Spark can run on many different environments, but Spark also integrates very well with Hadoop. For instance, it can read the common data formats in HDFS, and it can run directly on top of the YARN resource manager. Hadoop users are turning to Spark to replace their legacy MapReduce data pipelines and applications. Hive compiles SQL into Spark jobs for executing SQL queries (although not to be confused with Spark SQL). Similar to Hive, Pig compiles Pig scripts into Spark jobs for execution. Sqoop is using Spark to connect to various relational database systems for ETL.

Fun fact?

Matei Zaharia, the creator of Spark, is being called "God of Horse Metal" in China. "Horse metal" is phonetically the same as "Matei."

ApacheCon 2015

Speaker Interview

This article is part of the Speaker Interview Series for ApacheCon 2015. ApacheCon North America brings together the open source community to learn about the technologies and projects driving the future of open source and more. The conference takes place in Austin, TX from April 13-16, 2015.