Consider this simple JMH benchmark:

import org.openjdk.jmh.annotations.*; import java.util.concurrent.TimeUnit; @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class EmptyBench { @Benchmark public void emptyMethod() { // This method is intentionally left blank. } }

You might think this benchmark measures the empty method, but in reality it measures the minimal infrastructure code that services the benchmark: counts the iterations and waits for the iteration time to be over. Fortunately, that piece of code is rather fast, and so it can be dissected in full with the help of -prof perfasm .

This is out-of-the-box OpenJDK 8u191:

3.60% ↗ ...a2: movzbl 0x94(%r8),%r10d ; load "isDone" field 0.63% │ ...aa: add $0x1,%rbp ; iterations++; 32.82% │ ...ae: test %eax,0x1765654c(%rip) ; global safepoint poll 58.14% │ ...b4: test %r10d,%r10d ; if !isDone, do the cycle again ╰ ...b7: je ...a2

The empty method got inlined, and everything evaporated out of it, only the infrastructure remains.

See that "global safepoint poll"? When safepoint is needed, JVM would arm the "polling page", so any attempt to read that page would trigger the segmentation fault (SEGV). When SEGV finally fires from this safepoint poll, the control would be passed to any existing SEGV handlers first, and JVM has one ready! See, for example, how JVM_handle_linux_signal does it.

The goal of all those tricks is to make the safepoint polls as cheap as possible, because they need to happen in many places, and they almost always do not fire. For this reason, the test %eax, (addr) is used: it has no effects when safepoint poll is not triggered. It is also has very compact encoding, "only" 6 bytes on x86_64. The polling page address is fixed for a given JVM process, so the code generated by JIT in that process can use RIP-relative addressing: it says that the page is at given offset from the current instruction pointer, saving the need to spend precious bytes encoding the absolute 8-byte address.

There is also normally a single polling page that handles all threads at once, so generated code does not have to disambiguate which thread is currently running. But what if VM wants to stop individual threads? That is the question answered by JEP-312: "Thread-Local Handshakes". It provides the VM capability to trigger the handshake poll for the individual thread, which is currently implemented by assigning the individual polling page for each thread, and poll instruction reading that page address from thread-local storage.

This is out-of-the-box OpenJDK 11.0.1:

0.31% ↗ ...70: movzbl 0x94(%r9),%r10d ; load "isDone" field 0.19% │ ...78: mov 0x108(%r15),%r11 ; reading the thread-local poll page addr 25.62% │ ...7f: add $0x1,%rbp ; iterations++; 35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll 34.91% │ ...86: test %r10d,%r10d ; if !isDone, do the cycle again ╰ ...89: je ...70

This is purely a runtime consideration, so this can be disabled with -XX:-ThreadLocalHandshakes , and the generated code would then be the same as in 8u191. This explains why this benchmark performs differently on 8 and 11 (let us run it under -prof perfnorm right away):

Benchmark Mode Cnt Score Error Units # 8u191 EmptyBench.test avgt 15 0.383 ± 0.007 ns/op EmptyBench.test:CPI avgt 3 0.203 ± 0.014 #/op EmptyBench.test:L1-dcache-load-misses avgt 3 ≈ 10⁻⁴ #/op EmptyBench.test:L1-dcache-loads avgt 3 2.009 ± 0.291 #/op EmptyBench.test:cycles avgt 3 1.021 ± 0.193 #/op EmptyBench.test:instructions avgt 3 5.024 ± 0.229 #/op # 11.0.1 EmptyBench.test avgt 15 0.590 ± 0.023 ns/op ; +0.2 ns EmptyBench.test:CPI avgt 3 0.260 ± 0.173 #/op EmptyBench.test:L1-dcache-loads avgt 3 3.015 ± 0.120 #/op ; +1 load EmptyBench.test:L1-dcache-load-misses avgt 3 ≈ 10⁻⁴ #/op EmptyBench.test:cycles avgt 3 1.570 ± 0.248 #/op ; +0.5 cycles EmptyBench.test:instructions avgt 3 6.032 ± 0.197 #/op ; +1 instruction # 11.0.1, -XX:-ThreadLocalHandshakes EmptyBench.test avgt 15 0.385 ± 0.007 ns/op EmptyBench.test:CPI avgt 3 0.205 ± 0.027 #/op EmptyBench.test:L1-dcache-loads avgt 3 2.012 ± 0.122 #/op EmptyBench.test:L1-dcache-load-misses avgt 3 ≈ 10⁻⁴ #/op EmptyBench.test:cycles avgt 3 1.030 ± 0.079 #/op EmptyBench.test:instructions avgt 3 5.031 ± 0.299 #/op