This blog post demonstrates how to speed up your Spark test suite by running tests from the SBT console, using an efficient method to check DataFrame equality, and managing the SparkSession intelligently.

Spark code isn’t designed to perform well locally and test suites can quickly become painfully slow. Developer productivity plummets when developers are tempted to browse Reddit while the test suite runs.

We will demonstrate how to reduce the run time of the spark-spec open source test suite from 58 seconds to 36 seconds (38% reduction in test run time). In private production projects, these tactics have reduced the test suite run time by 50%+.

The blog post also outlines SBT tactics that make testing Spark code enjoyable.

Run a single test file

Running one test file at a time is faster than running the entire suite and allows for a nice red, green, refactor workflow.

Let’s add a Talk object that appends a green_pokemon column to a DataFrame.

package com.github.mrpowers.spark.bulba



import org.apache.spark.sql.DataFrame

import org.apache.spark.sql.functions._



object Talk {



def withGreenPokemon()(df: DataFrame): DataFrame = {

df.withColumn(

"green_pokemon",

lit("bulba bulba")

)



}



}

Let’s use scalatest, spark-daria, and spark-fast-tests to test the withGreenPokemon() method.

package com.github.mrpowers.spark.bulba



import org.apache.spark.sql.types.StringType



import org.scalatest.FunSpec



import com.github.mrpowers.spark.fast.tests.DataFrameComparer

import com.github.mrpowers.spark.daria.sql.SparkSessionExt._



class TalkSpec

extends FunSpec

with SparkSessionTestWrapper

with DataFrameComparer {



describe("withGreenPokemon") {



it("appends a green_pokemon column to a DataFrame") {



val sourceDF = spark.createDF(

List(

"grass",

"flower"

), List(

("food", StringType, true)

)

)



val actualDF = sourceDF.transform(Talk.withGreenPokemon())



val expectedDF = spark.createDF(

List(

("grass", "bulba bulba"),

("flower", "bulba bulba")

), List(

("food", StringType, true),

("green_pokemon", StringType, false)

)

)



assertSmallDataFrameEquality(actualDF, expectedDF)



}



}



}

The entire test suite can be run with the sbt test command.

The TalkSpec test file can be run individually with the sbt "testOnly *TalkSpec" command.

We can also right click the test and run it directly from the IntelliJ text editor.

I find it hard to read and debug the test output from the IntelliJ console and prefer to run Spark tests from the command line.

The spark-bulba project contains these code snippets if you’d like to clone the repo and run the tests on your local machine.

Use the SBT console

The sbt "testOnly *TalkSpec" command is executed in SBT’s batch mode and you’ll get better performance if you run the tests from the SBT console. sbt "testOnly *TalkSpec" fires up the SBT console, runs the test file, and then shuts down the SBT console. If we run the tests directly from the SBT console, we don’t need to wait for the SBT console to start and shut down every test run.

Run the sbt command to start the console. Then run > testOnly *TalkSpec to execute the tests from the console.

You don’t need to restart the console when you modify the code. Keep the console running as you iterate with the red, green, refactor cycle.

After a few runs, the SBT console will error out with these messages:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Spark Context Cleaner" Could not run test com.github.mrpowers.spark.bulba.TalkSpec: java.lang.OutOfMemoryError: Metaspace java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Metaspace

We need to update the build.sbt file to prevent the java.lang.OutOfMemoryError exception.

Configure the memory settings in your build.sbt file

You can increase the initial memory allocation pool and the maximum memory allocation pool for a Java Virtual Machine to prevent the java.lang.OutOfMemoryError exception.

Add the following two lines to your build.sbt file:

fork in Test := true

javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:+CMSClassUnloadingEnabled")

The Xms parameter sets the initial memory allocation pool to be 512MB and the Xmx parameter sets the maximum memory allocation pool to 2,048MB. You can tweak these settings based on your machine’s RAM.

Use the same SparkSession across test files

Starting and stopping the SparkSession is expensive, so it’s better to only create the SparkSession once and let it expire when the test suite finishes running.

Let’s create a SparkSessionTestWrapper trait that defines the SparkSession.

package com.github.mrpowers.spark.bulba



import org.apache.spark.sql.SparkSession



trait SparkSessionTestWrapper {



lazy val spark: SparkSession = {

SparkSession

.builder()

.master("local")

.appName("spark bulba")

.getOrCreate()

}



}

Notice that the the TalkSpec class we defined earlier was extended with the SparkSessionTestWrapper trait.

class TalkSpec

extends FunSpec

with SparkSessionTestWrapper

with DataFrameComparer

The SparkSessionTestWrapper trait allowed us to access the spark variable in our test suite to create DataFrames.

Configure the Spark Shuffle Partitions

As qnob pointed on in this pull request, changing the number of shuffle partitions can greatly reduce the test suite runtime.

I’ve reduced test suite runtimes by 30%, 66% and 70% using this technique.

The SparkSession can be defined as follows:

package com.github.mrpowers.spark.bulba



import org.apache.spark.sql.SparkSession



trait SparkSessionTestWrapper {



lazy val spark: SparkSession = {

SparkSession

.builder()

.master("local")

.appName("spark session")

.config("spark.sql.shuffle.partitions", "1")

.getOrCreate()

}



}

Notice how we’re setting the number of Spark shuffle partitions to one for the test suite. Most test DataFrames are tiny and don’t need the 200 shuffle partitions that Spark provides by default.

Make sure to only use this SparkSession when running your test suite. You’ll almost certainly want to use at least the default number of partitions when running the code on production size datasets.

Use assertSmallDataFrameEquality when possible

The spark-fast-tests library defines an assertLargeDataFrameEquality method that can be used to compare large DataFrames that are spread across nodes in a cluster. This code is copied from the assertDataFrameEquals method defined in spark-testing-base.

assertSmallDataFrameEquality is faster and should be used to compare DataFrames that are small enough to be collected on a single machine. I use assertSmallDataFrameEquality exclusively for all of my the projects. I don’t write any tests with DataFrames that are too large for my local machine — those would be way too slow anyways!

Combine DataFrame tests when possible

Let’s revisit our TalkSpec file and add a spec to make sure that the code works when the input value is null .

describe("withGreenPokemon") {



it("appends a green_pokemon column to a DataFrame") {



val sourceDF = spark.createDF(

List(

"grass",

"flower"

), List(

("food", StringType, true)

)

)



val actualDF = sourceDF.transform(Talk.withGreenPokemon())



val expectedDF = spark.createDF(

List(

("grass", "bulba bulba"),

("flower", "bulba bulba")

), List(

("food", StringType, true),

("green_pokemon", StringType, false)

)

)



assertSmallDataFrameEquality(actualDF, expectedDF)



}



it("works for null values") {



val sourceDF = spark.createDF(

List(

null

), List(

("food", StringType, true)

)

)



val actualDF = sourceDF.transform(Talk.withGreenPokemon())



val expectedDF = spark.createDF(

List(

(null, "bulba bulba")

), List(

("food", StringType, true),

("green_pokemon", StringType, false)

)

)



assertSmallDataFrameEquality(actualDF, expectedDF)



}



}

We’re invoking the Talk.withGreenPokemon() method in two different specs which will slow down the test suite. If the Talk.withGreenPokemon() method takes 5 seconds to run, then we’ll be adding 10 seconds to our test suite run time instead of 5 seconds. Let’s refactor this test, so the code is only run once.

describe("withGreenPokemon") {



it("appends a green_pokemon column to a DataFrame") {



val sourceDF = spark.createDF(

List(

"grass",

"flower",

// it works with the null case

null

), List(

("food", StringType, true)

)

)



val actualDF = sourceDF.transform(Talk.withGreenPokemon())



val expectedDF = spark.createDF(

List(

("grass", "bulba bulba"),

("flower", "bulba bulba"),

(null, "bulba bulba")

), List(

("food", StringType, true),

("green_pokemon", StringType, false)

)

)



assertSmallDataFrameEquality(actualDF, expectedDF)



}



}

Unit testing frameworks typically encourage splitting tests into separate it blocks to improve the readability of the test suite. We don’t have the luxury of splitting up Spark tests because they run slowly.

Notice that a comment was added to the sourceDF to clarify the intention of including null in the test.

Display run times of individual tests

Let’s update the build.sbt file to print out the run time of each test.

testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")

Our tests will now display their run times.

[info] TalkSpec:

[info] withGreenPokemon

[info] - appends a green_pokemon column to a DataFrame (6 seconds, 555 milliseconds)

A slow test may be a symptom of inefficient code. You might need to refactor your code to speed up the test. This will make your production code run faster too of course!

If you’re invoking the withGreenPokemon() method in two or more places, the test run time will help quantify how much the duplicate test is slowing you down.

Suppress the log output

Add the following test/resources/log4j.properties file to suppress noisy Spark output when the test suite is run.

# Set everything to be logged to the console

log4j.rootCategory=ERROR, console

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.target=System.err

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n



# Settings to quiet third party logs that are too verbose

log4j.logger.org.eclipse.jetty=WARN

log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR

log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN

log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN

Benchmarking a single test file

We ran the spark-spec FunctionsSpec file in SBT batch mode with the assertLargeDatasetEquality method as a baseline. The file has 88 tests and takes 35 seconds to run on average (the three test runs took 33s, 34s, and 37s).

When the code is refactored to use the assertSmallDatasetEquality method, the test suite takes 24 seconds to run (the three test runs took 22s, 27s, 22s). assertSmallDatasetEquality cuts the test suite run time by 31%.

When the tests are run from the SBT console instead of SBT’s batch mode (with assertSmallDatasetEquality ), the test suite takes takes 20 seconds to run (the three test runs took 20s, 20s, and 20s). We cut the test run time by another 17%.

Overall, we’ve taken a test file that took 35 seconds to run down to 20 seconds for a 43% time reduction.

Benchmarking an entire test suite

Let’s see if our performance gains carry through to an entire test suite run.

The slow_test_suite branch of spark-spec uses assertLargeDatasetEquality and takes 58 seconds to run in SBT batch mode (three test runs took 60s, 57s, and 59s).

The fast_test_suite branch of spark-spec uses assertSmallDatasetEquality and takes 36 seconds to run from the SBT console (three test runs took 36s, 35s, and 36s).

The fast_test_suite branch runs 38% faster.

Conclusion

Spark runs slowly locally, has an immature testing community, and developing a smooth testing workflow is difficult.

You’ll need to follow the best practices outlined in this post and continually fight to keep your test suite running fast.

If a test suite gets unbearably slow, you’ll either need to split your code into multiple SBT projects or pick up a new hobby while you wait for your test suite to finish running. Maybe you can pick up drawing?