Of course, before doing just about anything about a hypothetical performance problem, you have to understand if you actually do have a performance problem ("Hi, I’m Aleksey and I have a performance problem"). First, we have to admit that A*FU classes are used on very busy hotpaths in high-performance code, and sometimes people resort to Unsafe to keep those costs at bay.

Microbenchmarking comes in very handy for research like this, as it allows us to focus on the particular code sample running the specified conditions. Note that benchmarks seldom lie — computers are generally not equipped with a mind to even consider lying — but they do very frequently answer the wrong question, because you screwed up the environmental/benchmarking setup.

To answer "How bad are Atomic*FieldUpdaters, really?", we need to establish the baseline. We have chosen plain field accesses as the baseline, because that might be the fastest way to access memory in Java programs (as you will see later). In hindsight, we can write the benchmark like this:

@State(Scope.Benchmark) public class AFUBench { A a; B b; @Setup public void setup() { a = new A(); b = new B(); // pollute the class hierarchy } @Benchmark public int updater() { return a.updater(); } @Benchmark public int plain() { return a.plain(); } public static class A { static final AtomicIntegerFieldUpdater<A> UP = AtomicIntegerFieldUpdater.newUpdater(A.class, "v"); volatile int v; public int updater() { return UP.get(this); } public int plain() { return v; } } public static class B extends A {} }

Of course, the actual benchmarks where the issues were discovered are more thorough and check much more cases. We use a simplified example in this post for the sake of better flow: the example above is one of the worst cases for A*FU code, as we will see later. If we run this benchmark on my development desktop (i7-4790K, Linux x86_64) with the latest JDK 8u66, then we will see this:

Benchmark Mode Cnt Score Error Units AFUBench.plain avgt 25 1.965 ± 0.001 ns/op AFUBench.updater avgt 25 3.007 ± 0.004 ns/op

"This is just one nanosecond", one might say, but these nanoseconds add up on hot paths. The performance difference is much more visible on hardware that cannot speculate heavily. I frequently run the benchmarks on the Atom dev server from my home farm (Atom Z530, Linux i586), as that beast is very sensitive to the generated code quality.

Benchmark Mode Cnt Score Error Units AFUBench.plain avgt 25 21.436 ± 0.014 ns/op AFUBench.updater avgt 25 34.669 ± 0.025 ns/op

Whoa. >1.6x performance difference and it is not a single nanosecond anymore. Bad.

Developing fixes and conducting performance analysis on that hardware is more complicated than hacking directly on my dev desktop, since it entails cross-compiling x86_64 → x86, so we will keep bashing the tests on very fast x86_64, but taking care of more observable behaviors, like hardware counters and the generated code.

It helps to quickly characterize the workload to see where the problem might be. JMH provides the Linux’s perf_event bindings with -prof perfnorm , which normalizes the counters per benchmark operation:

Benchmark Mode Cnt Score Error Units AFUBench.plain avgt 25 1.989 ± 0.034 ns/op AFUBench.plain:·CPI avgt 5 0.318 ± 0.012 #/op AFUBench.plain:·L1-dcache-load-misses avgt 5 ≈ 10⁻³ #/op AFUBench.plain:·L1-dcache-loads avgt 5 17.368 ± 3.469 #/op AFUBench.plain:·L1-dcache-store-misses avgt 5 ≈ 10⁻⁴ #/op AFUBench.plain:·L1-dcache-stores avgt 5 4.345 ± 0.874 #/op AFUBench.plain:·branch-misses avgt 5 ≈ 10⁻⁴ #/op AFUBench.plain:·branches avgt 5 5.775 ± 1.114 #/op AFUBench.plain:·cycles avgt 5 11.472 ± 2.525 #/op AFUBench.plain:·instructions avgt 5 36.073 ± 6.926 #/op AFUBench.updater avgt 25 3.009 ± 0.002 ns/op AFUBench.updater:·CPI avgt 5 0.280 ± 0.002 #/op AFUBench.updater:·L1-dcache-load-misses avgt 5 0.001 ± 0.004 #/op AFUBench.updater:·L1-dcache-loads avgt 5 24.832 ± 2.255 #/op AFUBench.updater:·L1-dcache-store-misses avgt 5 ≈ 10⁻³ #/op AFUBench.updater:·L1-dcache-stores avgt 5 5.838 ± 0.514 #/op AFUBench.updater:·branch-misses avgt 5 ≈ 10⁻⁴ #/op AFUBench.updater:·branches avgt 5 8.775 ± 0.878 #/op AFUBench.updater:·cycles avgt 5 17.587 ± 1.707 #/op AFUBench.updater:·instructions avgt 5 62.859 ± 6.344 #/op

This data is abbreviated, to show where the problems lurk. There seems to be no ILP problems, as both versions run with CPI = 0.3 clk/insn, which is a good CPI for my Haswell. In fact, A*FU code has even better ILP. It seems that A*FU code does more loads, more stores (which includes spilling operands on stack), more branches, and more instructions. So, this does look like we need to shave off the excess code off the hotpath.

Indeed, if we use PrintAssembly to dump the generated code, we can clearly see the difference. JMH provides handy integration with -prof perfasm , that uses perf to contrast the hot regions in the compiled code. In this example, and in the examples further, we only show the hot benchmark loops, along with JMH benchmark scaffolding ( Blackholes , operation counters, termination flags).

This is what the plain scenario looks like:

LOOP: ↗ mov 0x8(%rsp),%r10 │ mov 0xc(%r10),%r10d ; get field $a │ mov 0xc(%r12,%r10,8),%edx ; get field $a.v │ mov 0x10(%rsp),%rsi ; prepare and call Blackhole.consume │ callq CONSUME │ mov 0x18(%rsp),%r10 │ movzbl 0x94(%r10),%r10d ; get field $isDone │ add $0x1,%rbp ; ops++ │ test %eax,0x16d88ff1(%rip) ; safepoint poll │ test %r10d,%r10d ; if (!isDone), get back ╰ je LOOP

This is almost the absolute minimum: the majority of the code is the benchmarking infrastructure, with only two instructions as the "business" payload. The updater scenario has much more cruft:

LOOP: ↗ mov 0x10(%rsp),%r10 │ mov 0xc(%r10),%r10d ; get field field $a │ mov 0x8(%r12,%r10,8),%r9d ; get $a.class │ movabs $0x719d45a88,%r11 ; {constant: AIFUImpl instance} │ mov 0xc(%r11),%r11d ; get AIFUImpl.tclass │ movabs $0x0,%r8 ; <some magic: Class.isInstance> │ lea (%r8,%r9,8),%r8 │ mov 0x68(%r8),%r9 │ mov %r11,%r8 │ shl $0x3,%r8 │ cmp %r8,%r9 │ jne SLOWPATH_1 │ movabs $0x719d45a88,%r11 ; {constant: AIFUImpl instance} │ mov 0x18(%r11),%r8d ; get AIFUImpl.cclass │ test %r8d,%r8d ; null check │ jne SLOWPATH_2 ; if (cclass == null), jump out │ mov %rcx,(%rsp) │ mov 0x10(%r11),%r11 ; get AIFUImpl.offset │ shl $0x3,%r10 ; unpack $a reference │ mov (%r10,%r11,1),%edx ; Unsafe: get field $a@offset │ mov 0x18(%rsp),%rsi ; prepare and call Blackhole.consume │ nop │ callq CONSUME │ mov (%rsp),%rcx │ movzbl 0x94(%rcx),%r10d ; get field $isDone │ add $0x1,%rbp ; ops++ │ test %eax,0x181c55ee(%rip) ; safepoint poll │ test %r10d,%r10d ; if (!isDone), get back ╰ je LOOP