Tuning Clojure - Experiences from a 45x performance improvement

This is a blog about the development of Yeller, The Exception Tracker with Answers Read more about Yeller here

Recently, I added an automated diagnosis feature to Yeller. This feature has various pieces, many of which don’t impact performance. However, there was one place, where adding this feature impacted performance quite significantly (it took a piece of code down to 2% of the original throughput. The tuning process was interesting, and taught me some new things, so I thought I’d write them up.

What code was impacted

Yeller uses Apache Kafka as a queueing/buffering mechanism. During periods of high throughput, it reads large batches (around 50MB) of exception data from Kafka and then “collapses” that into a set of writes to send to Riak (the backing data store). This collapsing code, therefore, has a big input on the overall throughput of the system.

Automated Diagnosis has to run during collapsing, because at that time Yeller rolls up duplicated exceptions (if it sees enough of them). Collapsing is the only phase in the entire system that sees every single exception - everything downstream sees aggregation. Automated Diagnosis has to run on every exception report to give good results, or it might lie about something happening on every occurrence of an exception (which is the point of the feature).

Collapsing already has two performance benchmarks (which run together): a criterium benchmark, and a throughput test. Both of these tests check collapsing 25000 exceptions at once (which is roughly the maximum batch size), and both throughput and latency matter (though for this code, which is a pure function, I expect those two to be very related).

The Tuning Process

Without any tuning, the collapse performance benchmarks after adding the new feature looked like this:

latency: Evaluation count : 60 in 60 samples of 1 calls. Execution time mean : 1.528072 sec Execution time std-deviation : 122.398896 ms Execution time lower quantile : 1.391261 sec ( 2.5%) Execution time upper quantile : 1.863047 sec (97.5%) Overhead used : 1.765926 ns Found 5 outliers in 60 samples (8.3333 %) low-severe 3 (5.0000 %) low-mild 2 (3.3333 %) Variance from outliers : 60.1232 % Variance is severely inflated by outliers collapse throughput rate mean: 5441.263908459467 (exceptions per second)

Before adding the feature 97.5th percentile latency was around 105ms, and throughput was around 250k exceptions per second. Clearly I had some work to do.

Round 1: Transients and a profiling reminder

The first thing I tried (without any profiling), was to use transients inside the automated diangosis code itself. Clojure transients are a mechanism for turning immutable data structures into a mutable copy (so you can update it with better performance characteristics), and then turning it back into an immutable copy. This “guess” was based on previous performance work, but unlike all of the other tuning here, was not backed by a profiler. That was a mistake, performance numbers after that “tuning” looked like this:

latency: Evaluation count : 60 in 60 samples of 1 calls. Execution time mean : 1.431665 sec Execution time std-deviation : 71.458972 ms Execution time lower quantile : 1.328528 sec ( 2.5%) Execution time upper quantile : 1.584562 sec (97.5%) Overhead used : 1.834184 ns Found 2 outliers in 60 samples (3.3333 %) low-severe 2 (3.3333 %) Variance from outliers : 35.2613 % Variance is moderately inflated by outliers collapse throughput rate mean: 5419.846383604233 (exceptions per second)

A minor difference in latency, but basically no throughput jump. As soon as I saw this measurement, I immediately realized my mistake:

Doing performance tuning without profiling is a waste of time

I know this, but it’s apparently still not completely automatic for me to reach straight for the profiler.

Profiling, round 1: multimethods

One of the pieces of automated diagnosis has to walk a tree of data, turning it into a set of diagnoses. This tree walk has to check what type of thing it is walking - if it is a collection, it inspects deep into the structure, but if it is a “leaf” (a string, a number, etc), then it doesn’t have to inspect inside that. Originally this was done using clojure multimethods, which allow for open polymorphism - you can add new definitions for new dispatch values from any place in your program, and override existing ones as well.

YourKit (the profiler I use), revealed a large amount of time was spent in java.lang.Class.getInterfaces . I traced this into the multimethods implementation. It seems like an easy win there is to eliminate that: switching to a manual if expression and using clojure’s isa? (this code doesn’t actually need to have open polymorphism).

Results:

latency: Evaluation count : 120 in 60 samples of 2 calls. Execution time mean : 623.842590 ms Execution time std-deviation : 31.523790 ms Execution time lower quantile : 589.430999 ms ( 2.5%) Execution time upper quantile : 690.901461 ms (97.5%) Overhead used : 1.804767 ns Found 2 outliers in 60 samples (3.3333 %) low-severe 1 (1.6667 %) low-mild 1 (1.6667 %) Variance from outliers : 36.8347 % Variance is moderately inflated by outliers collapse throughput rate mean: 14114.985888266345 (exceptions per second)

3x throughput and latency improvement

A pretty good start, but not back to anything like old performance levels yet.

Profiling round 2: isa?

So, having done that, I looked at the profiler again. After the above tweak, a lot of time was being spent in clojure’s isa? function (indirectly, but it was the root cause of 74% of the time spent). Taking a peek at the source, I remembered that isa? also hits into clojure’s other dispatch hierarchy mechanism. However, in this case, we don’t need that - we just need to do a java instanceof check. Clojure has a function for that: instance?

Results from that change:

latency: Evaluation count : 300 in 60 samples of 5 calls. Execution time mean : 189.152482 ms Execution time std-deviation : 12.207640 ms Execution time lower quantile : 171.564799 ms ( 2.5%) Execution time upper quantile : 214.078669 ms (97.5%) Overhead used : 1.558726 ns starting collapse throughput rate mean: 49478.46555787279 (exceptions per second)

3x performance improvement

Doing good. Time to go find the next bottleneck.

Profiling round 3: reduce-kv and defrecord lessons

So, back to the profiler again. My performance aims here were met by this point, but performance tuning is addictive, so I wanted to eliminate any remaining easy wins. This time, YourKit pointed at two places:

Inside the diagnosis code, we spend a bunch of time calling reduce to walk each map. We also spend quite a bit of time looking up a value in a map, earlier in the diagnosis code (before the tree walk)

Fixes:

If the type you are calling reduce on is known to be a map, you can get a reasonable performance improvement (sometimes) using reduce-kv, which is reduce specialized for maps (the reducing function takes a key, a value and whatever the previous reduce value was). The map lookup turned out to be from a missing field from a clojure defrecord . Clojure has defrecord - a macro for defining a java class that behaves (transparently) like a normal clojure map, except for performance. Importantly, you can assoc new keys onto record instances and they go into a backing map, which contains all unknown keys. In this case, I hadn’t declared that host (which is the server the exception happened on) was a field, even though all exceptions have that data, and the diagnosis phase specifically accesses that key. So, we can add that key to the defrecord :

latency: Evaluation count : 6 in 6 samples of 1 calls. Execution time mean : 124.182999 ms Execution time std-deviation : 5.806552 ms Execution time lower quantile : 119.441999 ms ( 2.5%) Execution time upper quantile : 133.730749 ms (97.5%) Overhead used : 1.804577 ns Found 1 outliers in 6 samples (16.6667 %) low-severe 1 (16.6667 %) Variance from outliers : 13.8889 % Variance is moderately inflated by outliers collapse throughput rate mean: 241615.37619855272

5x throughput improvement

Not bad! Sadly I only took one measurement for both of these patches, but from the profiling I think that the defrecord change was much more significant.

At this point I was about finished - I’d achieved only a 10k exceptions/s drop in throughput (despite a significant new feature), and latency was acceptable. Furthermore, the rest of the profiling didn’t show up any particularly large bottlenecks - the remaining ones were mostly in hasheq (a hashing function), and I couldn’t see an obvious way to optimize that (without changing the code more significantly).

Combined, these three rounds of profiling resulted in a:

45x throughput improvement

Takeaways

I think this is a pretty good illustration of the power of profiling for performance work. Collectively, we “know” that we should use a profiler, but occasionally we don’t, and performance work without it is much more difficult.

The performance exhibited by this piece of code is nothing stellar - I’m certain I can eke out a reasonable amount more if/when I need to. Right now though, it’s fast enough that it isn’t the bottleneck, so that would be a waste of time better spent on new features.

Learn how Yeller helps you deal with clojure exceptions in production

This is a blog about the development of Yeller, the Exception Tracker with Answers. Read more about Yeller here