TLDR; Run Spark Jobs from Clojure REPL or spark-submit using Flambo

At FORMCEPT, we have been an early adopter of Spark since its early Shark days in 2013. We also adopted Clojure mainstream into all our back-end services around the same time. To work with Apache Spark, we rely on the Flambo library — a Clojure DSL for Apache Spark — to which we have contributed a couple of wrapper functions as well.

Environment

Spark refers to Spark-2.4.4 (Aug 2019)

Flambo refers to 0.8.3-SNAPSHOT-9c61467

Clojure refers to Clojure-1.9.0

Read this blog if you are facing one or more of the following issues:

Unable to run a Spark Application with Clojure-Flambo

Unable to run a Spark Application from REPL or IDE

Flambo AOT ClassNotFoundException: flambo.function.Function

java.lang.IllegalStateException: unread block data with Kryo

Cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

Exception while getting task result com.esotericsoftware.kryo.KryoException: Buffer underflow

Get AOT-ized First

The foremost issue that was reported in the initial days of Flambo was around the usage of flambo.api that requires one to AOT-compile one’s namespaces that are referring to it. Although it was a part of README, this continued to haunt the project owner.

Let’s get this right one more time.

Use Flambo

Create a new project using lein and add Flambo dependency to your project.clj.

You will also need to include Spark dependency for Flambo, else you will see ClassNotFoundException, like java.lang.ClassNotFoundException: org.apache.spark.SparkConf. Next, you require to incorporate the Flambo namespace within your project’s spark.core namespace.

ClassNotFoundException without AOT

Now, if you try to use the spark.core namespace within a REPL session, you will get a ClassNotFoundException as we haven’t enabled AOT compilation for the namespace yet.

To resolve this issue, enable AOT for your spark.core namespace in project.clj.

Now, you should be able to use your namespace.

Congratulations! You have just successfully configured your Flambo environment for Spark.

Setup Spark

You can use any existing Spark instance to try out the steps mentioned in this blog. We will refer to the Spark Standalone Cluster for the current examples. Please refer to Spark Documentation to set up a cluster. Keep the Spark Master URI ready for your reference. Also, make sure you can access the Spark Master webpage as shown below:

For the rest of the examples in this blog, the Spark Master URI that we will refer to is spark://172.17.0.1:7077

Spark Application

Let’s take the evergreen example of ‘word count’. For the sake of simplicity, we will put the data to be used for word count within our source code.

Package and Run

To run the word count application, first, package it using lein uberjar.

Next, run it using Spark’s spark-submit utility.

Spark Application UI

You should be able to see the application submitted to Spark in Spark Master UI in the RUNNING state while it is computing the word count.

You should also be able to see the Jobs and Tasks running under application UI at http://172.17.0.1:4040/jobs/ during the lifetime of the application.

Also, please take note of the Environment for the application using the same application UI as shown below:

As you can see, the spark.kryo.registrator is set to flambo.kryo.BaseFlamboRegistrator and the spark.jars also refers to the uberjar that we created earlier. Under the Classpath Entries of the application on the Environment tab, you will also see the same uberjar added by the user.

Congratulations! You have successfully run a Word Count application in Spark using Clojure. (You are now a Spark Word Count graduate!)

Using REPL

Now, let’s try to run the same application using Clojure REPL.

In this case, the entire job fails due to java.lang.IllegalStateException: unread block data. As you can see, JavaDeserializationStream has failed to read the data and it looks like it is no longer using the Kryo serializer as intended by the Flambo.

Note

Spark uses Java serialization by default and Kryo library (version 4) for Kryo serialization. For more details, see Spark Tuning — Data Serialization.

The Curious Case of ‘unread block data’

The exception java.lang.IllegalStateException: unread block data is a bit misleading if you look only at the driver logs emitted on the REPL. To find the root cause of the exception, open the application UI and take a look at the failed stages of the job.

Click on collect at NativeMethodAccessorImpl.java:0 to open all the stages of the job.

Click on the failed stage mapToPair at NativeMethodAccessorImpl.java:0 to view the failed tasks.

Now, click on stderr to see the Executor logs.

Here, you can see the full stack trace that shows the root cause as Caused by: java.lang.ClassNotFoundException: flambo.kryo.BaseFlamboRegistrator which has caused Spark to fail to register the required class for Kryo, as reported by org.apache.spark.SparkException: Failed to register classes with Kryo. Further, if you check the Environment for the application, you will see that spark.jars key is missing from the Spark Properties.

To resolve this, you can either add the JAR for the missing class or include all the dependencies referred to by the system class loader as shown below:

In REPL mode, if you add all the JARs by referring to the system classloader, it will include Spark, Scala and also all other libraries (including the provided profile dependencies) within spark.jars which may cause conflict with the existing Spark libraries. This might cause exceptions like:

ClassCastException cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD java.io.ObjectStreamClass$FieldReflector.setObjFieldValues (ObjectStreamClass.java:2133)

The best way to solve this problem is to explicitly include the uberjar under spark.jars that will contain only the dependencies required by the project, excluding all the provided dependencies of Spark. Initialize your Spark session as shown below:

Now, rebuild the uberjar and try to run a Spark job on REPL.

In this case, the uberjar gets added under Environment — Spark Properties correctly, but the job fails with an exception related to Kryo.

The ERROR log shows that Spark is using org.apache.spark.serializer.KryoSerializerInstance as intended. This was not the case in our earlier run with the REPL. If you look at the dependency tree, you will find a couple of dependencies that conflict with the Kryo version (v4.0.2) that is required by Spark at runtime.

To fix this issue, exclude Kryo and Chill libraries from Flambo dependency in the project.clj as shown below.

Rebuild uberjar and run the Spark job on REPL now.

Voila! This time it runs as expected! Moreover, you retain the Spark Context on the REPL and you can submit as many jobs as you want — all within the same running Spark application.

To stop the application, exit the REPL session. Since you can now run a Spark job from a REPL session, you can also debug your code using IDEs and code editors like Emacs.

If you want to experiment with this project, feel free to fork formcept/whiteboard on GitHub. The project is available under formcept/whiteboard/clojure/spark directory.

Sounds interesting? Apache Spark is one of the many cutting-edge technologies that we extensively use at FORMCEPT. To know more about the technologies underlying our flagship product MECBot, please take a look at our product architecture here: https://www.mecbot.ai/platform/

If you would like to solve some of the most challenging data problems with our team, apply here: https://angel.co/company/formcept/jobs