A brief history of Spark APIs

2014Matei Zaharia and his colleagues at UC Berkley published a paper introducing a new abstraction called Resilient Distributed Datasets (RDDs). It was the start of a new data processing solution that addressed some of the pain points of MapReduce. RDDs could be cached in-memory, which is a massive win for iterative algorithms, and had many implementation strategies that allowed for better fault tolerance, data locality, and efficiency in general.

The API for RDDs is type-safe and similar to other solutions earlier introduced like Twitter’s Scalding (2012). For example, this is a transformation that, given an RDD of tweets, returns their top hashtags:

Transformations return well-typed RDDs and are defined using regular Scala functions. Even though the API is reasonably good regarding usability, it has downsides:

The user could make the mistake of doing reduceByKey before filtering the words ( _.startsWith("#") ), making the transformation much more expensive. The inefficiency is evident for this case, but there are many more subtle scenarios where the user could introduce inefficiencies. The execution engine can’t introspect Scala functions. The flatMap transformation _.text.split("\\s+") uses only the text column from the tweet object. If the tweet information comes from a columnar format like parquet, it would be more efficient to load only that column from storage. Spark can’t do anything other than loading the full object and then calling the Scala function with it because _.text.split("\\s+") is opaque. The API mixes lazy transformations that return RDD instances like flatMap and filter with strict actions like top that trigger the RDD execution and bring the data to memory. It’s common to see beginners using actions without understanding its implications. Functions can capture values from the outer scope (closure), eventually failing at runtime if a value is not serializable. This is an issue that the spores project tries to address.

2015The Spark community introduced an API called Dataframe and a new execution engine based on SQL to support it. It addressed some of the optimization limitations of RDDs by making transformations based on untyped string values to represent columns and expressions:

With this approach, the computation is less opaque to the execution engine. For instance, if the tweets are loaded from parquet, Spark knows that only the column 'text is used and can avoid loading the rest of the fields. Also, the user’s intent is more evident since the transformation uses first-class operations like group by and the count aggregation.

2016The Spark community made a new attempt at making the API more type-safe while keeping some of the benefits of Dataframe . The result was the Dataset API, which mixes RDD -like operations with Dataframe -like operations:

The API switches between Dataset and Dataframe depending on how the operation is done. In the current Spark version, Dataframe is just an untyped Dataset . This approach still has some problems:

1. Some of the transformations ( flatMap , filter , map ) use opaque Scala functions, so they don’t enable more advanced optimizations.

2. At a lower degree, Dataset brings back the problems with the possibility of inefficient usage of the API. For instance, if the user forgets to select the text column at the beginning, Spark will have to load the entire tweet to apply the flatMap(_.text.split("\\s+")) transformation.

3. The untyped operations make the code unsafe, prone to runtime type errors, and harder understand.