Consider this simple JMH benchmark. We construct that benchmark in a very special way (assume Java has pre-processing capabilities, for simplicity):

import org.openjdk.jmh.annotations.*; import java.util.concurrent.TimeUnit; @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @Fork(3) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @State(Scope.Benchmark) public class FPUSpills { int s00, s01, s02, s03, s04, s05, s06, s07, s08, s09; int s10, s11, s12, s13, s14, s15, s16, s17, s18, s19; int s20, s21, s22, s23, s24; int d00, d01, d02, d03, d04, d05, d06, d07, d08, d09; int d10, d11, d12, d13, d14, d15, d16, d17, d18, d19; int d20, d21, d22, d23, d24; int sg; volatile int vsg; int dg; @Benchmark #ifdef ORDERED public void ordered() { #else public void unordered() { #endif int v00 = s00; int v01 = s01; int v02 = s02; int v03 = s03; int v04 = s04; int v05 = s05; int v06 = s06; int v07 = s07; int v08 = s08; int v09 = s09; int v10 = s10; int v11 = s11; int v12 = s12; int v13 = s13; int v14 = s14; int v15 = s15; int v16 = s16; int v17 = s17; int v18 = s18; int v19 = s19; int v20 = s20; int v21 = s21; int v22 = s22; int v23 = s23; int v24 = s24; #ifdef ORDERED dg = vsg; // Confuse optimizer a little #else dg = sg; // Just a plain store... #endif d00 = v00; d01 = v01; d02 = v02; d03 = v03; d04 = v04; d05 = v05; d06 = v06; d07 = v07; d08 = v08; d09 = v09; d10 = v10; d11 = v11; d12 = v12; d13 = v13; d14 = v14; d15 = v15; d16 = v16; d17 = v17; d18 = v18; d19 = v19; d20 = v20; d21 = v21; d22 = v22; d23 = v23; d24 = v24; } }

It reads and writes multiple pairs of fields at once. Optimizers are not actually tied up to the particular program order. Indeed, that is what we would observe in unordered test:

Benchmark Mode Cnt Score Error Units FPUSpills.unordered avgt 15 6.961 ± 0.002 ns/op FPUSpills.unordered:CPI avgt 3 0.458 ± 0.024 #/op FPUSpills.unordered:L1-dcache-loads avgt 3 28.057 ± 0.730 #/op FPUSpills.unordered:L1-dcache-stores avgt 3 26.082 ± 1.235 #/op FPUSpills.unordered:cycles avgt 3 26.165 ± 1.575 #/op FPUSpills.unordered:instructions avgt 3 57.099 ± 0.971 #/op

This gives us around 26 load-store pairs, which corresponds roughly to 25 pairs we have in the test. But we don’t have 25 general purpose registers! Perfasm reveals that optimizer had merged load-store pairs close to each other, so that register pressure is much lower:

0.38% 0.28% ↗ movzbl 0x94(%rcx),%r9d │ ... 0.25% 0.20% │ mov 0xc(%r11),%r10d ; getfield s00 0.04% 0.02% │ mov %r10d,0x70(%r8) ; putfield d00 │ ... │ ... (transfer repeats for multiple vars) ... │ ... ╰ je BACK

At this point, we want to cheat the optimizer a little, and make a point of confusion so that all loads are performed well before the stores. This is what ordered test does, and there, we can see the loads and stores are happening in bulk: first all the loads, then all the stores. The register pressure is highest at the point where all the loads have completed, but none of the stores have started yet. Even then, we have no significant difference against unordered :

Benchmark Mode Cnt Score Error Units FPUSpills.unordered avgt 15 6.961 ± 0.002 ns/op FPUSpills.unordered:CPI avgt 3 0.458 ± 0.024 #/op FPUSpills.unordered:L1-dcache-loads avgt 3 28.057 ± 0.730 #/op FPUSpills.unordered:L1-dcache-stores avgt 3 26.082 ± 1.235 #/op FPUSpills.unordered:cycles avgt 3 26.165 ± 1.575 #/op FPUSpills.unordered:instructions avgt 3 57.099 ± 0.971 #/op FPUSpills.ordered avgt 15 7.961 ± 0.008 ns/op FPUSpills.ordered:CPI avgt 3 0.329 ± 0.026 #/op FPUSpills.ordered:L1-dcache-loads avgt 3 29.070 ± 1.361 #/op FPUSpills.ordered:L1-dcache-stores avgt 3 26.131 ± 2.243 #/op FPUSpills.ordered:cycles avgt 3 30.065 ± 0.821 #/op FPUSpills.ordered:instructions avgt 3 91.449 ± 4.839 #/op

…​and that is because we have managed to spill operands to XMM registers, not on stack:

3.08% 3.79% ↗ vmovq %xmm0,%r11 │ ... 0.25% 0.20% │ mov 0xc(%r11),%r10d ; getfield s00 0.02% │ vmovd %r10d,%xmm4 ; <--- FPU SPILL 0.25% 0.20% │ mov 0x10(%r11),%r10d ; getfield s01 0.02% │ vmovd %r10d,%xmm5 ; <--- FPU SPILL │ ... │ ... (more reads and spills to XMM registers) ... │ ... 0.12% 0.02% │ mov 0x60(%r10),%r13d ; getfield s21 │ ... │ ... (more reads into registers) ... │ ... │ ------- READS ARE FINISHED, WRITES START ------ 0.18% 0.16% │ mov %r13d,0xc4(%rdi) ; putfield d21 │ ... │ ... (more reads from registers and putfileds) │ ... 2.77% 3.10% │ vmovd %xmm5,%r11d : <--- FPU UNSPILL 0.02% │ mov %r11d,0x78(%rdi) ; putfield d01 2.13% 2.34% │ vmovd %xmm4,%r11d ; <--- FPU UNSPILL 0.02% │ mov %r11d,0x70(%rdi) ; putfield d00 │ ... │ ... (more unspills and putfields) │ ... ╰ je BACK

Notice that we do use general-purpose registers (GPRs) for some operands, but when they are depleted, we spill. "Then" is ill-defined here, because we appear to first spill, and then use GPRs, but this is a false appearance, because register allocators may operate on the complete graph. .

The latency of XMM spills seems minimal: even though we do claim more instructions for spills, they execute very efficiently and fill the pipelining gaps: with 34 additional instructions, which means around 17 spill pairs, we have claimed only 4 additional cycles. Note that it would be incorrect to compute the CPI as 4/34 = ~0.11 clk/insn, which would be larger than current CPUs are capable of. But the improvement is real, because we use execution blocks we weren’t using before.

The claims of efficiency mean nothing, if we don’t have anything to compare with. But here, we do! We can instruct Hotspot to avoid using FPU spills with -XX:-UseFPUForSpilling , which gives us the idea how much do we win with XMM spills:

Benchmark Mode Cnt Score Error Units # Default FPUSpills.ordered avgt 15 7.961 ± 0.008 ns/op FPUSpills.ordered:CPI avgt 3 0.329 ± 0.026 #/op FPUSpills.ordered:L1-dcache-loads avgt 3 29.070 ± 1.361 #/op FPUSpills.ordered:L1-dcache-stores avgt 3 26.131 ± 2.243 #/op FPUSpills.ordered:cycles avgt 3 30.065 ± 0.821 #/op FPUSpills.ordered:instructions avgt 3 91.449 ± 4.839 #/op # -XX:-UseFPUForSpilling FPUSpills.ordered avgt 15 10.976 ± 0.003 ns/op FPUSpills.ordered:CPI avgt 3 0.455 ± 0.053 #/op FPUSpills.ordered:L1-dcache-loads avgt 3 47.327 ± 5.113 #/op FPUSpills.ordered:L1-dcache-stores avgt 3 41.078 ± 1.887 #/op FPUSpills.ordered:cycles avgt 3 41.553 ± 2.641 #/op FPUSpills.ordered:instructions avgt 3 91.264 ± 7.312 #/op

Oh, see the increased load/store counters per operation? These are stack spills: the stack itself, while fast, still resides in memory, and thus accesses to stack land in L1 cache. It is roughly the same 17 additional spill pairs, but now they take ~11 cycles. The throughput of L1 cache is the limiting factor here.

Finally, we can eyeball the perfasm output for -XX:-UseFPUForSpilling :

2.45% 1.21% ↗ mov 0x70(%rsp),%r11 │ ... 0.50% 0.31% │ mov 0xc(%r11),%r10d ; getfield s00 0.02% │ mov %r10d,0x10(%rsp) ; <--- stack spill! 2.04% 1.29% │ mov 0x10(%r11),%r10d ; getfield s01 │ mov %r10d,0x14(%rsp) ; <--- stack spill! │ ... │ ... (more reads and spills to stack) ... │ ... 0.12% 0.19% │ mov 0x64(%r10),%ebp ; getfield s22 │ ... │ ... (more reads into registers) ... │ ... │ ------- READS ARE FINISHED, WRITES START ------ 3.47% 4.45% │ mov %ebp,0xc8(%rdi) ; putfield d22 │ ... │ ... (more reads from registers and putfields) │ ... 1.81% 2.68% │ mov 0x14(%rsp),%r10d ; <--- stack unspill 0.29% 0.13% │ mov %r10d,0x78(%rdi) ; putfield d01 2.10% 2.12% │ mov 0x10(%rsp),%r10d ; <--- stack unspill │ mov %r10d,0x70(%rdi) ; putfield d00 │ ... │ ... (more unspills and putfields) │ ... ╰ je BACK