My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in brewing, cooking and skiing.I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I lead engineering on AWS Lambda and our other serverless products. Before that, I worked on EC2 and EBS.All opinions are my own.

Are volatile reads really free?

There is a somewhat common belief among Java programmers that reads from volatile variables are free. Are volatile variable 'reads' as fast as normal reads? from StackOverflow is a perfect example, because it's the top result I get from Google for most related searches. The top answer says:

On an x86, there is no additional overhead associated with volatile reads.

and goes on to say:

JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.

Reading Doug Lea's The JSR-133 Cookbook for Compiler Writers gives the same impression. It says:

Issue a StoreStore barrier before each volatile store.

and

Issue a StoreLoad barrier after each volatile store. Note that you could instead issue one before each volatile load, but this would be slower for typical programs using volatiles in which reads greatly outnumber writes. Alternatively, if available, you can implement volatile store as an atomic instruction (for example XCHG on x86) and omit the barrier. This may be more efficient if atomic instructions are cheaper than StoreLoad barriers.

and

Issue LoadLoad and LoadStore barriers after each volatile load.

Doug lists the StoreStore, LoadLoad and LoadStore barriers as noops on x86. It makes some sense to read this as a validation of the idea that volatile reads are free. No extra instructions are issued for the reads, and instructions are what makes computers take time, so no more time is taken. Right?

The first step to answering that question is seeing if the Java JRE 1.6 actually behaves like Doug says it should. Possibly the easiest way to do this is to add the -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly options to the Java command line, and see the assembly that's being generated by the JIT. I wrote a very small toy program which accessed a volatile variable in a loop (to force the JIT to kick in).

Here's the code it generated for both volatile and non-volatile reads:

nop ;*synchronization entry mov 0x10(%rsi),%rax ;*getfield x

And for volatile writes:

xchg %ax,%ax movq $0xab,0x10(%rbx) lock addl $0x0,(%rsp) ;*putfield x

The JVM is behaving exactly like it should according the to the spec, with one minor difference. It's using lock addl $0x0,(%rsp) as the StoreLoad barrier. The gory details can be found in chapter 8 of volume 3A of Intel's IA-32 developer manual, but the short version is:

Locked operations are atomic with respect to all other memory operations and all externally visible events. [...] Locked instructions can be used to synchronize data written by one processor and read by another processor.

While addl $0x0,(%rsp) does nothing, lock addl $0x0,(%rsp) behaves like the StoreLoad barrier that Doug Lea talks about. The minor difference from Lea's description is the xchg %ax,%ax before the store. xchg does have some memory ordering influence, but I'm not sure of the role it plays here - it seems to me like this would be correct without it, so there must be something I am missing.

Anyway, it looks like what the JVM is doing in this case follows Doug Lea's recommendations. We can expect volatile reads to be free, right?

To test the theory, I wrote a program which does the following:

Launches 3 reader threads, which try to do 500 million reads from a shared variable and stores to a local variable

Launches 1 writer thread, which increments the shared variable 500 million times

Synchronizes their start times, and times them to completion

I ran it on my quad-core machine (Intel Q6600), and did three runs. Once, with a non-volatile shared variable, once with a volatile shared variable and once with a volatile shared variable and no writer thread. The graph below summarizes the results.

In this experiment, volatile writes are nearly 100x more expensive than normal writes. That corresponds well with the volatile writes are expensive mental model. What's most notable, though, is that volatile reads when there is a writer are about 25x more expensive than non-volatile reads. The included error bars show one standard deviation both sides of the mean run time, and clearly show that the speed of volatile read access also varies widely. If the code's the same, then what's going on?

What we are seeing is the effect of the processor trying to keep its contract with the lock instruction, and make sure that the data written by the writer is visible to the readers. Memory access and cache coherency are some of the most expensive things that modern processors do, and it's not easy to predict their performance impact from the assembly code. On the single-threaded single-core x86 processors of the past it was hard enough to read an assembly dump and predict performance. On modern multicore systems, it's extraordinarily difficult.

The third set of results tells another interesting story about volatile reads, one that's closer to the meaning of volatile in C. To illustrate the difference, I wrote a much simpler program which reads a variable, increments it, and writes it back to another variable. Without volatile, the compiler generates code like this:

nop ;*getfield isum add %r10,%r11 add %r10,%r11 ... 12 more adds ... add %r10,%r11 mov %r11,0x10(%rbx) ;*putfield isum

The loop is unrolled, and the variables simply stored in registers. With volatile, the code looks like this:

test %r11d,%r11d je 0x00007f89bd05e7b4 ;*synchronization entry mov 0x18(%r11),%r10 ;*getfield x add %r10,%r9 ;*ladd mov %r9,0x10(%rbx) ;*putfield isum mov 0xc(%rbx),%r11d ;*getfield this$0

I've only included one piece of the unrolled loop here. The compiler wrapped five sections like this in the larger outer loop, reducing the loop overhead somewhat. It does a whole lot more work per step that the non-volatile version (the details are unimportant), and isn't nearly as aggressive about unrolling (I assume because it's making a heuristic decision based on code size).

It appears as though reads to volatile variables are not free in Java on x86, or at least on the tested setup. It's true that the difference isn't so huge (especially for the read-only case) that it'll make a difference in any but the more performance sensitive case, but that's a different statement from free.

EDIT: It's worth noting that I am not criticizing Doug Lea's The JSR-133 Cookbook for Compiler Writers. He doesn't say that volatile reads are free, and doesn't even suggest it. That's just something that many people seem to have inferred from his writing.