Spark is the core component of Teads’s Machine Learning stack. We use it for many ML applications, from ad performance predictions to user Look-alike Modeling. We also use Spark for processing intensive jobs like cross-device segment extension or Parquet to SSTables transformation for loading data into Cassandra.

Working with Spark we regularly reach the limits of our clusters’ resources in terms of memory, disk or CPU. A scale-out only pushes back the issue so we have to get our hands dirty.

Here is a collection of best practices and optimization tips for Spark 2.2.0 to achieve better performance and cleaner Spark code, covering:

How to leverage Tungsten,

Execution plan analysis,

Data management (caching, broadcasting),

Cloud-related optimizations (including S3).

Update 07/12/2018, see also the second part covering troubleshooting tricks and external data source management.

1- Use the power of Tungsten

It’s common sense, but the best way to improve code performance is to embrace Spark’s strengths. One of them is Tungsten.

Standard since version 1.5, Tungsten is a Spark SQL component that provides increased performance by rewriting Spark operations in bytecode, at runtime. Tungsten suppresses virtual functions and leverages close to bare metal performance by focusing on jobs CPU and memory efficiency.

To make the most out of Tungsten we pay attention to the following:

Use Dataset structures rather than DataFrames

To make sure our code will benefit as much as possible from Tungsten optimizations we use the default Dataset API with Scala (instead of RDD).

Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations. This API is the most up to date and adds type-safety along with better error handling and far more readable unit tests.

However, it comes with a tradeoff as map and filter functions perform poorer with this API. Frameless is a promising solution to tackle this limitation.

Avoid User-Defined Functions (UDFs) as much as possible

Using a UDF implies deserialization to process the data in classic Scala and then reserialize it. UDFs can be replaced by Spark SQL functions, there are already a lot of them and new ones are regularly added.

Avoiding UDFs might not generate instant improvements but at least it will prevent future performance issues, should the code change. Also, by using built-in Spark SQL functions we cut down our testing effort as everything is performed on Spark’s side. These functions are designed by JVM experts so UDFs are not likely to achieve better performance.

For example the following code can be replaced by the built-in coalesce function:

def currency = udf(

(currencySub: String, currencyParent: String) ⇒

Option(currencyParent) match {

case Some(curr) ⇒ curr

case _ ⇒ currencySub

}

)

When there is no built-in replacement, it is still possible to implement and extend Catalyst’s (Spark’s SQL optimizer) expression class. It will play well with code generation. For more details, Chris Fregly talked about it here (see slide 56). By doing this we directly access Tungsten format, it solves the serialization problem and bumps performance.

Avoid User-Defined Aggregate Functions (UDAFs)

A UDAF generates SortAggregate operations which are significantly slower than HashAggregate. For example, what we do instead of writing a UDAF that compute a median is using a built-in equivalent (quantile 0,5):

df.stat.approxQuantile(“value”, Array(0.5), 0)

The approxQuantile function uses a variation of the Greenwald-Khanna algorithm. In our case, it ended up being 10 times faster than the equivalent UDAF.

Avoid UDFs or UDAFs that perform more than one thing

Software Craftsmanship principles obviously apply when writing big data stuff (do one thing and do it well). By splitting UDFs we are able to use built-in functions for one part of the resulting code. It also greatly simplify testing.

2- Look under the hood

Analysing Spark’s execution plan is an easy way to spot potential improvements. This plan is composed of stages, which are the physical units of execution in Spark. When we refactor our code, the first thing we look for is an abnormal number of stages. A suspicious plan can be one requiring 10 stages instead of 2–3 for a basic join operation between two DataFrames.

In Spark and more generally in distributed computing, sending data over the network (a.k.a. Shuffle in Spark) is the most expensive action. Shuffles are expensive since they involve disk I/O, data serialization and network I/O. They are needed for operations like Join or groupBy and happen between stages.

Considering this, reducing the number of stages is a obvious way to optimize a job. We use the .explain(true) command to show the execution plan detailing all the steps (stages) involved for a job. Here is an example:

Simple execution plan example

The Directed Acyclic Graph (DAG) in Spark UI can also be used to visualize the task repartition in each stage.

A very simple DAG example — Image credits

Optimization relies a lot on both our knowledge of the data and its processing (incl. business logic). One of the limits of Spark SQL optimization with Catalyst is that it uses “mechanic” rules to optimize the execution plan (in 2.2.0).

Like many others, we were waiting for a cost-based optimization engine beyond broadcast join selection. It now seems available in 2.3.0, we will have to look at that.

3- Know your data and manage it efficiently

We’ve seen how to improve job performance by looking into the execution plan but there are also plenty of possible enhancements on the data side.

Highly imbalanced datasets

To quickly check if everything is ok we review the execution duration of each task and look for heterogeneous process time. If one of the tasks is significantly slower than the others it will extend the overall job duration and waste the resources of the fastest executors.

It’s fairly easy to check min, max and median duration in Spark UI. Here is a balanced example: