The main reason for the performance issues is the conversion from JSON and, to a lesser extent, the fact that all columns in our DataFrame will be strings. Most of the columns are actually numerical, so strings will be converted to doubles (or longs) again and again when running queries involving means, averages, etc.

If we have enough memory available we can use df.persist(StorageLevel.MEMORY_ONLY) ( cache() is just an alias for persist(StorageLevel.MEMORY_ONLY) ) to avoid reading the files and converting from JSON for each query.

The first time we do the row count it actually takes longer, since there's extra work that needs to be done in order to do the caching in memory, but if we make the same query for the count again it takes less than a second. Calculating the average trip distance only takes 3 seconds.

In short, caching has improved the performance dramatically at the cost of some memory. In fact, we need to configure Zeppelin to actually have enough memory available for Spark to cache the whole RDD, since it needs 1670 MB and by default we only have 400 MB ( spark.driver.memory * spark.storage.memoryFraction ) available for storage. We need to add the following to conf/zeppelin-env.sh to have 2.1 GB available for storage.

export ZEPPELIN_INTP_MEM="-Xms4g -Xmx4g"

Visit http://localhost:4040/executors/ after running some code block in Zeppelin to see how many executors you have running (there should be one, with executor ID "driver") and the amount of storage memory per executor. Then go to http://localhost:4040/storage/ to see the RDDs, and what percentage of them, that are cached.

You may be wondering why there's an RDD at all if we're using DataFrames. The reason is that DataFrames are built on top of RDDs, and our table (or view) is "backed" by this RDD.

As if the configuration hassles weren't bad enough the performance can become unpredictable if there's a shortage of memory, since Spark can and will evict persisted RDDs from memory if necessary. If that happens we may end up reading all the files and parsing all the JSON again. As a compromise we could use StorageLevel.MEMORY_AND_DISK , but then we would not get the same performance improvements and all columns would still be of type "string."

It would be better if we could use a columnar data format with types. Parquet is one such format. We will create a Parquet file from our DataFrame. We get several benefits from using Parquet:

There will be no need to parse JSON strings since it's a binary format. We can still use caching, but if and when it's necessary to read the data from disk it will be much faster.

The data can be queried more efficiently because the format is based on columns, so there's no need to look around in a JSON structure to find a field, and columns that are irrelevant to the query can be skipped entirely.

Aggregations which need to scan all values in a column can be done much more efficiently because there is no need for disk seeks and the column data is read sequentially, which is very cache friendly for CPUs.

The data will be much smaller because Parquet supports compression. Compression is very helpful since some of our data consists of strings with a lot of repetition, like the names of the two taxi companies.