Apache Spark DataFrames have existed for over three years in one form or another. They provide Spark with much more insight into the data types it's working on and as a result allow for significantly better optimizations compared to the original RDD APIs. Furthermore, RDD functionality while not deprecated is no longer receiving updates and it's suggested that users migrate to DataFrames/DataSets.

In this post we'll will look in more detail at how DataFrames compare to RDDs in terms of features and performance, and see if the claims about them are true.

What are RDDs

Resilient Distributed Datasets (RDDs) are the core abstraction behind Spark and come from the original Spark paper. RDDs are an immutable fault-tolerant distributed memory based abstraction over data that allows for complex in-memory computations along with optional disk caching. Within the contraints of Spark they allow for the most flexible data manipulation and storage paradigms. There is a cost to this power, since the data stored is arbitrary and Spark has no usable knowledge of it's format the automatic optimization it can leverage are limited. Likewise, the Python implementation can only use the JVM to store opaque binary blobs which means all computations must be done on the slower Python side (and require serialization/deserialization to/from Python).

rdd = sc.textFile("/data/pagecounts-rrd") \ .map(lambda s: s.split(" ")) \ .map(lambda s: (s[1], int(s[2]))) \ .keyBy(lambda r: r[0][0:10]) \ .mapValues(lambda r: r[1]) grouped = rdd.reduceByKey(lambda a, b: a + b) grouped.join(rdd) \ .filter(lambda r: r[1][0] != r[1][1]) \ .count()

What are DataFrames

DataFrames are a newer abstration of data within Spark and are a structured abstration (akin to SQL tables). Unlike RDDs they are stored in a column based fashion in memory which allows for various optimizations (vectorization, columnar compression, off-heap storage, etc.). Their schema is fairly robust allowing for arbitrary nested data structures (ie: a column can be a list of structs which are composed of other structs). The fact that Spark is aware of the structure of the underlying data allows for non-JVM languages (Python, R) to leverage functions written in the faster native JVM implementation. One of the issues with DataFrames however is that they are only runtime and not compile time type safe which for a language like Scala introduces a severe drawback.

df = spark.read.load("/data/pagecounts-parquet") \ .withColumn("pagename", substring(col("pagename"), 0, 10)) grouped = df. \ groupBy("pagename"). \ sum("pageviews") grouped.join(df, "pagename") \ .filter(col("sum(pageviews)") != col("pageviews")) \ .count()

DataSets

DataSets are an even newer abstraction and can be thought of as a typed DataFrames. In fact in Scala a DataFrame is simply a DataSet with Row elements. They exist only within Java and Scala, and allow for a compile time type-safe version of DataFrames. I mention them for completeness sake but will otherwise not delve into them.

Performance

To test some of the performance differences between RDDs and Dataframes I'm going to use a cluster of two m4.large nodes running the 4.0 Runtime (Spark 2.3) on Databricks Cloud. The reference dataset will be wikipedia page views, and I'll be doing a mix of aggregations, joins and UDF operations on it. I'm going to compare the results across different ways or running code on both RDDs and DataFrames. You can see the set of operations being performed below:

df = spark.read.load("/data/pagecounts-parquet") \ .withColumn("pagename", substring(col("pagename"), 0, 10)) grouped = df. \ groupBy("pagename"). \ sum("pageviews") grouped.join(df, "pagename") \ .filter(col("sum(pageviews)") != col("pageviews")) \ .count()

Admittedly this isn't a complete test suite however hopefully even a simple test like this gleans some insights. I'm going to test the following dimensions: