The default behaviour of a Spark-job is losing data when it explodes. This isn’t a bad choice per se, but at my current project we need higher reliability. In this article I’ll talk about at least once delivery with the Spark write ahead log. This article is focussed on Kafka, but can also be applied to other Receivers.

Message delivery and idempotency

A Spark receivers consumes Kafka messages. It sounds simple, but there are some choices here. There are three types of message delivery (source)

At most once: Each record will be either processed once or not processed at all.

At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates.

Exactly once: Each record will be processed exactly once – no data will be lost and no data will be processed multiple times. […]

For this article I assume the messages are idempotent, “messages can be applied multiple times without changing the result beyond the initial application.” This assumption holds for the project I’m currently working on and it’s a nice thing to have since it makes things a lot easier.

The messages on our project are important and shouldn’t get lost, so at most once delivery won’t work.

Exactly once might be a choice, but it makes things needlessly complex since you never know where in a micro-batch something went wrong (other than committing every message).

Reliability and Spark

A Spark system is be divided in a Driver and Executor. When an Executor fails there’s nothing to worry about, it will be restarted and the data that was being processed will be digested again. Note that when you chose for exactly once delivery you’re in trouble. How do you know which data is processed and/or delivered to the next system (ie. your database)?

When the Driver fails you’re in more trouble. When the Drivers fails the Executors fail and won’t restart. With Kafka you hava no guarantee your cursor is in the right place. A solution for this problem is checkpointing. Write the data to storage (HDFS/S3/disk) periodically and tell Kafka you wrote the data after it successfully completes. When things fail you have some duplicate messages, but nothing is lost.

The first thing you have to do is enable the write ahead log (this is the place where your checkpointing data will be stored). The parameter is called spark.streaming.receiver.writeAheadLog.enable and you set it with

sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")

When you enable this and start your job (what you shouldn’t do) you’ll get the following error :

“Cannot enable receiver write-ahead log without checkpoint directory set. Please use streamingContext.checkpoint() to set the checkpoint directory.”

So let’s do that (I love usable error messages)

streamingContext.checkpoint("/user/hdfs/wal/" + streamingContext.sparkContext().appName());

When you don’t specify a protocol HDFS will be used on Yarn or a local file in local mode. To store on S3 check out the sources section on how to do that. Note that storing to disk is not really a solution for a cluster because you never know on which node of the cluster a new Executor will appear, you can only use this in local testing.

Subdirectories are created automatically, just make sure you have writing rights enabled on the directory you’re writing to. I chose the Application Name as the name for the subdirectory since we’re having a lot of Jobs and they all deserve their own directory.

Now everything works, but you’ll probably get the following warning :

“User defined storage level StorageLevel(true, true, false, false, 2) is changed to effective storage level StorageLevel(true, true, false, false, 1) when write ahead log is enabled”

This is because you used StorageLevel.MEMORY_AND_DISK_SER_2 (or something else with 2 in the end). This StorageLevel uses a replication of 2. Since we’re already storing the checkpoints on a reliable system (ie. HDFS) this is redundant, we don’t need replication. Changing the StorageLevel to MEMORY_AND_DISK_SER is good enough.

One final things to keep in mind is that you need a reliable Receiver. When you use the Spark KafkaUtils class to read from Kafka there’s nothing to worry about.

Testing

Of course this all sound nice, but the final step is to see things fail and make sure they recover. When you haven’t seen an application or unit test fail you haven’t tried hard enough to make it reliable.

Just do a kill -9 or let your code throw random Exceptions and restart your Jobs. When you do that make sure messages are delivered more than once. When you break things you should see duplicate messages (this is a good thing, remember?).

When you’re running Spark locally (ie. from your ide) you should be aware that there are no retries on task failure. You can enable this by adding a number to the spark.master parameter :

public class HarrieStreamingJobApplication { public static void main(String[] args) { System.setProperty("spark.master", "local[2,3]"); System.setProperty("hdfs.wal.path","/tmp/spark"); //override since there is no hdfs locally HarrieStreamingJob.main(new String[]{}); } }

The added ,3 means there are 3 retries on task failure (the first 2 is the number of cores used). A failed task now will result in a retry instead of a failed executor.

Note that the cluster parameter is spark.task.maxFailures with a default value of 4. So it can appear a bit strange that the default values for local and cluster differ.

Conclusion

When the messages can’t get lost in your system you might consider making the idempotent, it makes life a lot easier. There is a small performance penalty for write ahead log, but for us it’s a small price to pay for reliability. When performance gets much worse you can consider to add Receivers to ease the pain.

Always test your assumptions since it can get complex and you want to be absolutely sure everything works as planned.

Sources