Background:

While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.

To my surprise, removing the un-necessary instruction caused my program to slow down.

I found that adding arbitrary, useless MOV instructions increased performance even further.

The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.

I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.

The data:

A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).

The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):

avg time (ms) with -dJUNKOPS: 1822.84 ms avg time (ms) without: 1836.44 ms

The programs were run 25 times in a loop, with the run order changing randomly each time.

Excerpt:

{$asmmode intel} procedure example_junkop_in_sha256; var s1, t2 : uint32; begin // Here are parts of the SHA-256 algorithm, in Pascal: // s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22) // s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25) // Here is how I translated them (side by side to show symmetry): asm MOV r8d, a ; MOV r9d, e ROR r8d, 2 ; ROR r9d, 6 MOV r10d, r8d ; MOV r11d, r9d ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total} XOR r10d, r8d ; XOR r11d, r9d ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total} XOR r10d, r8d ; XOR r11d, r9d // Here is the extraneous operation that I removed, causing a speedup // s1 is the uint32 variable declared at the start of the Pascal code. // // I had cleaned up the code, so I no longer needed this variable, and // could just leave the value sitting in the r11d register until I needed // it again later. // // Since copying to RAM seemed like a waste, I removed the instruction, // only to discover that the code ran slower without it. {$IFDEF JUNKOPS} MOV s1, r11d {$ENDIF} // The next part of the code just moves on to another part of SHA-256, // maj { r12d } := (a and b) xor (a and c) xor (b and c) mov r8d, a mov r9d, b mov r13d, r9d // Set aside a copy of b and r9d, r8d mov r12d, c and r8d, r12d { a and c } xor r9d, r8d and r12d, r13d { c and b } xor r12d, r9d // Copying the calculated value to the same s1 variable is another speedup. // As far as I can tell, it doesn't actually matter what register is copied, // but moving this line up or down makes a huge difference. {$IFDEF JUNKOPS} MOV s1, r9d // after mov r12d, c {$ENDIF} // And here is where the two calculated values above are actually used: // T2 {r12d} := S0 {r10d} + Maj {r12d}; ADD r12d, r10d MOV T2, r12d end end;

Try it yourself:

The code is online at GitHub if you want to try it out yourself.

My questions: