These days I find myself telling people that benchmark numbers don’t matter on their own. It’s important what models you derive from those numbers. Refined performance models are by far the noblest and greatest achievement one could get with the benchmarking — it contributes to understanding how computers, runtimes, libraries, and user code work together.

For the sake of demonstration, we will start with the sample problem which does not involve measuring time directly yet. Let us ask ourselves: "What is the cost of volatile write?". It seems a simple question to answer with benchmarking, right? Shove in some threads, do some measurements, done!

OK then, let’s run this benchmark:

@BenchmarkMode ( Mode . AverageTime ) @OutputTimeUnit ( TimeUnit . NANOSECONDS ) @State ( Scope . Thread ) @Warmup ( iterations = 5 , time = 200 , timeUnit = TimeUnit . MILLISECONDS ) @Measurement ( iterations = 5 , time = 200 , timeUnit = TimeUnit . MILLISECONDS ) @Fork ( 50 ) public class VolatileWriteSucks { private int plainV ; private volatile int volatileV ; @GenerateMicroBenchmark public int baseline () { return 42 ; } @GenerateMicroBenchmark public int incrPlain () { return plainV ++; } @GenerateMicroBenchmark public int incrVolatile () { return volatileV ++; } }

Let’s measure this on some handy platform, say my laptop (2x2 i5-2520M, 2.0 GHz, Linux x86_64, JDK 8 GA) with a single worker thread:

Benchmark Mode Samples Mean Mean error Units o.s.VolatileWriteSucks.baseline avgt 250 2.042 0.017 ns/op o.s.VolatileWriteSucks.incrPlain avgt 250 3.589 0.025 ns/op o.s.VolatileWriteSucks.incrVolatile avgt 250 15.219 0.114 ns/op

Okay. Volatile writes are almost 5x slower! That means if we use volatile writes in my application, it becomes 5x slower! We can avoid volatiles at all costs, and get the immediate performance boost! Yeah, well…​ I don’t know how to break this to people, but this experiment has a fatal flaw.

That flaw is not with benchmark methodology, almost. The benchmark truly measures what it intended to measure: how much time we spend incrementing the volatile variable in these particular conditions. But is it the thing we really want to know: how system performs when we bash it with heavy-weight operations? Surely not, our production code is not that stupid. In real code, the heavy-weight operations are mixed with relatively low-weight ops, which amortize the costs. Therefore, to gain a useful data from the experiment, we need to simulate that mix.

Emulating a real workload is a painful exercise on its own. Luckily, we faced this issue so frequently, that JMH has the emulator of its own. Meet BlackHole.consumeCPU(int tokens) . It "consumes" CPU time linear to numbers tokens , and hopefully does it without the contention and messing with other computations. It does not sleep, but really does burn off CPU time. This enables us to make a more complicated experiment, which will guide us towards clearer performance model:

@BenchmarkMode ( Mode . AverageTime ) @OutputTimeUnit ( TimeUnit . NANOSECONDS ) @State ( Scope . Thread ) @Warmup ( iterations = 5 , time = 200 , timeUnit = TimeUnit . MILLISECONDS ) @Measurement ( iterations = 5 , time = 200 , timeUnit = TimeUnit . MILLISECONDS ) @Fork ( 50 ) public class VolatileBackoff { @Param ({ "0" , "1" , "2" , "3" , "4" , "5" , "6" , "7" , "8" , "9" , "10" , "11" , "12" , "13" , "14" , "15" , "16" , "17" , "18" , "19" , "20" , "21" , "22" , "23" , "24" , "25" , "26" , "27" , "28" , "29" , "30" , "31" , "32" , "33" , "34" , "35" , "36" , "37" , "38" , "39" , "40" }) private int tokens ; private int plainV ; private volatile int volatileV ; @GenerateMicroBenchmark public void baseline_Plain () { BlackHole . consumeCPU ( tokens ); } @GenerateMicroBenchmark public int baseline_Return42 () { BlackHole . consumeCPU ( tokens ); return 42 ; } @GenerateMicroBenchmark public int baseline_ReturnV () { BlackHole . consumeCPU ( tokens ); return plainV ; } @GenerateMicroBenchmark public int incrPlain () { BlackHole . consumeCPU ( tokens ); return plainV ++; } @GenerateMicroBenchmark public int incrVolatile () { BlackHole . consumeCPU ( tokens ); return volatileV ++; } }

There, we "back off" a little bit before doing the operation under test. @Param allows us to juggle the backoff, and so it helps to estimate how well the amortization is working. You may also notice there are two more baseline implementations, we put it there deliberately to illustrate another point below.

OK, doing the experiment:

If you do the graphs like these and think it’s alright, you are going to Hell. If I am there earlier, I will make sure you will be shown the cryptic trend lines potentially having the data how to break free and get to Heaven. Those charts, obviously, would not say anything you would want them to say. Ever.

We may want to contrast the chart a bit by subtracting baseline_Plain , and ignoring other baselines for a moment:

Looks cool, and seems to prove the amortizing works. It would seem from the data that after 20 consumed tokens, we may stop to care about the cost of volatile ops. Looking back at experimental data, or the chart above (yes, sometimes you can get a tiny bit of useful data from a bad chart), it means the volatile op each 50ns is only marginally slower than the plain op each 50ns.

Okay now, we also had a few other baselines. Before we look at them, let us ask ourselves, "Was it really a good idea to subtract baseline_Plain ?" The answer is, unfortunately, "No, it was not!". Here’s why, let us subtract baseline_Return42 :

Double U. Tee. Ef. Increments are now faster than a baseline? It’s not surprising for the seasoned performance guys, really, because performance is not composable: there is no way to predict how two modules with known independent performances will perform together.

"Surely you are joking", someone would say, "`baseline_Return42` is only slower because it obviously does more operations than baseline_Plain , namely returning the integer constant". Fair enough, let us look at something that does even more work: baseline_ReturnV , which also reads the integer from memory before returning it:

See, it works faster! The point is, baseline measurements are also the experimental data, and they are good to be the reference for the effect you test. You can not promote them to be some golden table values you can unconditionally trust. With headlines like that, we might as well compare incrPlain vs incrVolatile difference directly:

It chimes back to our observation that volatile write costs are dramatically amortized if we are not choking the system with them. This exercise shows a few important points:

We need performance models to predict the system behavior across the wide variety of conditions;

Building performance models implicitly assumes control, and that control may show the questionable behaviors of experimental setup, as we saw above with baselines;

And most importantly, these combinatorial experiments allow us to mix the operations in different manners, and reason about their independent performance with more predictive power.