JMH offers support for running multi-threaded benchmarks, which is a new feature for such frameworks (to the best of my knowledge). I'll cover some of the available annotations and command line options I used for this benchmark. First up, let's look at the code:

-t 4 - will launch the benchmark with 4 benchmark threads hitting the code.

- will launch the benchmark with 4 benchmark threads hitting the code. -tc 1,2,4,8 - will replace both the count of iterations and the number of threads by the thread count pattern described. This example will result in the benchmark being run for 4 iterations, iteration 1 will have 1 benchmarking thread, the second will have 2, the third 4 and the last will have 8 threads.

- will replace both the count of iterations and the number of threads by the thread count pattern described. This example will result in the benchmark being run for 4 iterations, iteration 1 will have 1 benchmarking thread, the second will have 2, the third 4 and the last will have 8 threads. -s - will scale the number of threads per iteration from 1 to the maximum set by -t

And the results are?

As sanity check, during development I ran the benchmark locally. My laptop is a dual core MBA and is really not a suitable environment for benchmarking, still the results matched the theory although with massive measurement error:

Even at this point it was clear the false sharing is a significant issue. Results are consistent, if unstable in their suggestion of how bad it is.



Also as expected, volatile writes are slower than lazy/ordered writes which are slower than plain writes.

Once I was happy with the code I deployed it on a more suitable machine. A dual socket Intel Xeon CPU E5-2630 @ 2.30GHz, using OpenJDK 7u9 on CentOS 6.3(OS and JDK are 64bit). This machine has 24 logical cores and is tuned for benchmarking (what that means is a post onto itself). First up I thought I'd go wild and use all the cores, just for the hell of it. Results were interesting if somewhat flaky.

Using all the available CPU power is not a good idea if you want stable results ;-)

I decided to pin the process to a single socket and use all it's cores, this leaves plenty of head room for the OS and give me data on scaling up to 12 threads. Nice. At this point I could have a meaningful run with 1,2,4,8,12 threads to see the way false sharing hurts scaling for the different write methods. At this point things started to get more interesting:

Volatile writes suffering from false sharing degrade in overall performance as more threads are added, 12 threads suffering from false sharing achieve less than 1 thread.



Once false sharing is resolved volatile writes scale near linearly until 8 threads, the same is true for lazy and plain. 8 to 12 seem to not scale quite so cleanly.

Lazy/Plain writes do not suffer quite as much from false sharing but the more threads are at work the larger the difference (up to 2.5x) becomes between false sharing and unshared.

Scaling from 1-12 threads I focused on the 12 thread case, running longer/more iterations and observing the output, it soon became clear the results for the false shared access suffer from high run-to-run variance. The reason for that, I assumed, was the different memory layout from run to run resulting in different cache line populations. In the case of 12 threads we have 12 adjacent longs. Given the cache line is 8 longs wide they can be distributed over 2 or 3 lines in a number of ways (2-8-2,1-8-3,8-4,7-5,6-6). Going ever deeper into the rabbit hole I thought I could print out the thread specific stats (choose -otss , JMH delivers again :-) ). The thread stats did deliver a wealth of information, but at this point I hit slight disappointment... There was no (easy) way for me to correlate thread numbers from the run to thread numbers in the output, which would have made my life slightly easier in characterising the different distributions...

, JMH delivers again :-) ). The thread stats did deliver a wealth of information, but at this point I hit slight disappointment... There was no (easy) way for me to correlate thread numbers from the run to thread numbers in the output, which would have made my life slightly easier in characterising the different distributions... I had enough and played with the data until I was happy with the results. And I finally found a use for a stacked bar chart :-). Each thread score is a different color, the sum of all thread scores is the overall score.

We can put our expectations/theory to the test by running the benchmark and scaling the number of threads participating. JMH offers us several ways to control the number of threads allocated to a benchmark:I ended up either setting the number of threads directly or using the 'pattern' option. Variety is the spice of life though :-) and its nice to have the choice.False sharing as a phenomena is a bit random (in Java where memory alignment is hard to control) as it relies on the memory layout falling a particular way on top of the cache line. Consider 2 adjacent memory locations being written to by 2 different threads, should they be placed such that they are on different cache lines in a particular run then that run will not suffer false sharing. This means we'll want to get statistics from a fair amount of runs to make sure we hit the variations we expect on memory layout. JMH supports running multiple forks of your benchmark by using theoption to determine the number of forks.Left as an exercise to the reader :-). I implore you to go have a play, come on.Here's a high level review on my experimentation process for this benchmark:Here's the results for plain/volatile shared/unshared running on 12 cores (20 runs, 5 iterations per run:):