Even though the spec says about actions (x = 1) , (x = 2) , and (r1 = x) as synchronization actions tied into the synchronization order, blah blah blah, it does not mean the actual runtime has to do all the program operations. Most runtimes would though, because this analysis — whether someone should be able to observe (x = 1) — is generally rather complicated.

Similarly, when JMM says (for example) that there are program actions tied into the synchronization order, it does not mean the actual physical implementation should be emitting those loads and stores to the machine code!

In fact, it is optimizable in most languages, and it is allowed to happen because the observed result of the execution is one of the results of abstract machine execution. As long as programs cannot call runtime’s bluff (= detect something that language specification disallows) the runtime is free to do stuff under cover.

…​it would be remarkably odd to require that the language runtime actually does allocate storage for all three local variables, store values there, load them back, add them, etc. This whole method should be optimizable to something like this:

The actual requirement is much weaker: the runtime is obliged to produce results as if there is a compatible abstract machine execution that backs the results . What the runtime does to do the actual computation is up to the runtime. It’s all smoke and mirrors.

However, it is a very misleading way of thinking about the issue. Language specification describes the behavior of the abstract machine executing the program. It is the runtime’s job to emulate the behavior of the abstract machine. The point of contention here is that a compatible runtime is not obliged to compile the program exactly as it is written in the source code.

The first order of business is the confusion between the language specification, and what hits the real hardware. It is easy, nay comfortable, to read the language rules and think that it is exactly what the machine will do.

The Cookbook was written to aid the actual compiler writers to quickly come up with a conforming implementation. Are you a compiler writer looking for implementation guidance? No? Thought so. Move along then. The bad thing that happens after you digest the JSR 133 Cookbook is that you start to believe in…​

JSR 133 Cookbook is one of the possible, yet conservative, sets of rules to implement JMM . One of the possible means that a conforming implementation does not have to follow the Cookbook, as long as it satisfies the JMM requirements. Conservative means it does not go into the intricacies of the model, and instead provides a very simple, yet coarse, implementation. It might be unnecessarily strong for practical use. We can go even deeper in our conservatism, and still arrive at JMM-conforming implementation: make sure JVM runs on a single core, or have a Global Interpreter Lock, and then concurrency is trivial.

Quite a few folks who get themselves burned by the abstract JMM rules, rest their gaze at JSR 133 Cookbook for the Compiler Writers . All those sweet, easy to understand barriers are much easier to grasp than the arcanery of the formal model. So, many are boldly suggesting Cookbook is a brief description (or even a short equivalent) of Java Memory Model.

2.3. Myth: Barriers Are The Sane Mental Model

…​while in fact, they are not: they are merely an implementation detail. The easiest example why barriers are not reliable as a mental model, is the following simple test case with two back-to-back synchronized statements:

@JCStressTest @State public class SynchronizedBarriers { int x, y; @Actor void actor() { synchronized(this) { x = 1; } synchronized(this) { y = 1; } } @Actor void observer(IntResult2 r) { // Caveat: get_this_in_order()-s happen in program order r.r1 = get_this_in_order(y); r.r2 = get_this_in_order(x); } }

Naively, you may think the 1, 0 case is prohibited, because synchronized sections should execute in an order consistent with a program order.

Of course, without keeping reads in order, the result 1, 0 is trivially achievable. But this does not make an interesting test case. The actual test is clever about that: it uses the new VarHandles "opaque" access mode, which inhibits these optimizations and exposes the reads to hardware in the same order: private static final VarHandle VH_X, VH_Y; static { try { VH_X = MethodHandles.lookup().findVarHandle(Test.class, "x", int.class); (1) VH_Y = MethodHandles.lookup().findVarHandle(Test.class, "y", int.class); (1) } catch (Exception e) { throw new IllegalStateException(e); } } @Actor public void observer(IntResult2 r) { r.r1 = (int) VH_Y.getOpaque(this); (2) r.r2 = (int) VH_X.getOpaque(this); (2) } 1 Lookup VarHandle for the fields 2 Get the associated field value from this object, with "opaque" access mode You may get a similar effect with non-inlined get_this_in_order() methods that would also be opaque to the optimizer today. Coupled with the hardware that does not reorder reads, you have the reads satisfied in the program order. You can emit a full barrier between the loads, if you want to be extra safe in the face of weaker hardware, although it will mud the waters with barrier interactions. The point of this example, however, is to see what is happening on the writer side, assuming that everything happens in order on the reader side. Do not overlook the writer side, chasing the technicalities on the reader side.

Let’s see what barriers tell us about code semantics. In pseudo-code, this will do:

void actor() { [LoadStore] // between monitorenter and normal store x = 1; [StoreStore] // between normal store and monitorexit [StoreLoad] // between monitorexit and monitorenter [LoadStore] // between monitorenter and normal store y = 1; [StoreStore] // between normal store and monitorexit [StoreLoad] // between monitorexit and monitorenter } void observer() { // Caveat: get_this_in_order()-s happen in program order r.r1 = get_this_in_order(y); r.r2 = get_this_in_order(x); }

Yup, seems fine. x = 1 cannot go past y = 1 , because it will meet barriers long before that.

However, the JMM itself allows observing 1, 0 , because the reads of x and y are not tied in any ordering constraints, and therefore there exists a plausible execution that justifies observing 1, 0 . More formally, whatever conforming execution you can imagine, the reads of x and y are not synchronization actions, and therefore SO rules do not apply to the induced actions. The reads are not tied into HB, and therefore no HB rules are preventing from reading the racy values. There are no causality loops in observing 1, 0 too.

Allowing this behavior in the model is intentional for two reasons. First of all, the hardware should be able to perform independent operations in whatever order it wants to maximize performance. Secondly, this enables interesting and important optimizations.

For instance, in the example above, we can coalesce back-to-back locks:

void actor() { synchronized(this) { x = 1; } synchronized(this) { y = 1; } } // ... becomes: void actor() { synchronized(this) { x = 1; y = 1; } }

…​which improves performance (because lock acquisition is costly), and allows further optimizations within the synchronized block. Notably, since the writes of x and y are independent, we may allow hardware to execute them in an arbitrary order, or allow optimizers to shift them around.

If you run the example above on an actual JVM and hardware, this is what happens on x86 with JDK 9 "fastdebug" build (needed to gain access to instruction scheduling fuzzing):

[OK] net.shipilev.jmm.LockCoarsening (fork: #1, iteration #1, JVM args: [-server, -XX:+UnlockDiagnosticVMOptions, -XX:+StressLCM, -XX:+StressGCM]) Observed state Occurrences Expectation Interpretation 0, 0 43,558,372 ACCEPTABLE All other cases are acceptable. 0, 1 22,512 ACCEPTABLE All other cases are acceptable. 1, 0 1,565 ACCEPTABLE_INTERESTING X and Y are visible in different order 1, 1 1,372,341 ACCEPTABLE All other cases are acceptable.

Notice the interesting case, that is our 1, 0 . Surprise!

Disabling lock optimizations with -XX:-EliminateLocks trims down the number of occurrences of this interesting case to zero:

[OK] net.shipilev.jmm.LockCoarsening (fork: #1, iteration #1, JVM args: [-server, -XX:+UnlockDiagnosticVMOptions, -XX:+StressLCM, -XX:+StressGCM, -XX:-EliminateLocks]) Observed state Occurrences Expectation Interpretation 0, 0 52,892,632 ACCEPTABLE All other cases are acceptable. 0, 1 163,611 ACCEPTABLE All other cases are acceptable. 1, 0 0 ACCEPTABLE_INTERESTING X and Y are visible in different order 1, 1 1,825,907 ACCEPTABLE All other cases are acceptable.

On POWER, the interesting case is present even without messing with instruction scheduling, because hardware guarantees are weaker:

[OK] net.shipilev.jmm.LockCoarsening (fork: #1, iteration #1, JVM args: [-server]) Observed state Occurrences Expectation Interpretation 0, 0 7,899,607 ACCEPTABLE All other cases are acceptable. 0, 1 4,089 ACCEPTABLE All other cases are acceptable. 1, 0 162 ACCEPTABLE_INTERESTING X and Y are visible in different order 1, 1 240,682 ACCEPTABLE All other cases are acceptable.

This example does not mean it is possible to enumerate all "dangerous" optimizations and disable them. Modern optimizers work as complicated graph matching-and-crunching machines, and reliably disabling a particular kind of optimizations usually means disabling the optimizer completely.

There are other kinds of plausible optimizations around barriers that runtimes are making, or will choose to do in the future. Even JSR 133 Cookbook has the "Removing Barriers" section that gives a short outline of what elision techniques are readily available.

Given that, how can you trust barriers, if they are routinely removable?