Recently I've been doing some benchmarking and came upon a very surprising behavior from a number of different Intel i7 CPUs (it manifests on Sandy Bridge and Haswell desktop-class CPUs as well as Sandy Bridge-EP Xeon CPUs).

The benchmark is very simple and the result is... bizarre. Perhaps one of the readers of my blog knows what's going on here. Here's the C code for the benchmark (full code with a makefile is available in this Gist):

const unsigned N = 400 * 1000 * 1000 ; volatile unsigned long long counter = 0 ; // Don't inline the benchmarking code into main void __attribute__((noinline)) tightloop(); void __attribute__((noinline)) loop_with_extra_call(); void tightloop () { unsigned j; for (j = 0 ; j < N; ++j) { counter += j; } } void foo () { } void loop_with_extra_call () { unsigned j; for (j = 0 ; j < N; ++j) { __asm__( "call foo" ); counter += j; } }

We're benchmarking tightloop vs. loop_with_extra_call , which does exactly the same thing (increment a volatile counter) but has a dummy call to a do-nothing function in the middle. I don't think anyone has doubts about how this should behave, right? How much slower do you think the extra call will make this loop? Twice as slow? 10% slower?

Here's the driving main function:

int main ( int argc, char ** argv) { if (argc <= 1 ) { return 1 ; } if (argv[ 1 ][ 0 ] == 't' ) { tightloop(); } else if (argv[ 1 ][ 0 ] == 'c' ) { loop_with_extra_call(); } return 0 ; }

Building the code with gcc version 4.8 (same output code is produced by 4.6, as well as when replacing -O2 by -O3 ):

$ gcc -O2 loop-call-weirdness.c -o build/loop-call-weirdness

Now I'll run it on my Intel i7-4771 (Haswell) CPU. First run the version with tightloop :

$ perf stat -r 10 -e cycles,instructions build/loop-call-weirdness t Performance counter stats for 'build/loop-call-weirdness t' (10 runs): 2,659,506,002 cycles # 0.000 GHz ( +- 0.19% ) 2,401,144,539 instructions # 0.90 insns per cycle ( +- 0.00% ) 0.685642994 seconds time elapsed ( +- 0.24% )

... and with the extra call:

$ perf stat -r 10 -e cycles,instructions build/loop-call-weirdness c Performance counter stats for 'build/loop-call-weirdness c' (10 runs): 2,336,765,798 cycles # 0.000 GHz ( +- 0.34% ) 3,201,055,823 instructions # 1.37 insns per cycle ( +- 0.00% ) 0.602387097 seconds time elapsed ( +- 0.39% )

Yes, the extra call makes the code faster! You didn't expect that, did you.

Looking at the disassembly, the compiler is doing fine here, producing quite expected code:

0000000000400530 <tightloop>: 400530: xor %eax,%eax 400532: nopw 0x0(%rax,%rax,1) 400538: mov 0x200b01(%rip),%rdx # 601040 <counter> 40053f: add %rax,%rdx 400542: add $0x1,%rax 400546: cmp $0x17d78400,%rax 40054c: mov %rdx,0x200aed(%rip) # 601040 <counter> 400553: jne 400538 <tightloop+0x8> 400555: repz retq 400557: nopw 0x0(%rax,%rax,1) 0000000000400560 <foo>: 400560: repz retq 0000000000400570 <loop_with_extra_call>: 400570: xor %eax,%eax 400572: nopw 0x0(%rax,%rax,1) 400578: callq 400560 <foo> 40057d: mov 0x200abc(%rip),%rdx # 601040 <counter> 400584: add %rax,%rdx 400587: add $0x1,%rax 40058b: cmp $0x17d78400,%rax 400591: mov %rdx,0x200aa8(%rip) # 601040 <counter> 400598: jne 400578 <loop_with_extra_call+0x8> 40059a: repz retq 40059c: nopl 0x0(%rax)

Note that the volatile is key here, since it forces the compiler to produce a load and store from the global on each iteration. Without volatile , the benchmark behaves normally (the extra call makes it significantly slower).

It's easy to see that tightloop runs 6 instructions per iteration, which computes with the numbers reported by perf (400 million iterations, times 6 instructions, is 2.4 billion instructions). loop_with_extra_call adds two more instructions per iteration (the call to foo and the ret from it), and that also corresponds to the performance numbers.

That's right, even though the version with the extra call executes 33% more instructions, it manages to do it quicker.

Unfortunately, my fast Haswell CPU (or the Linux kernel coming with Ubuntu 13.10) doesn't support the whole range of perf stat counters, but running on an older CPU (where the anomaly also exists though the performance difference in smaller), I see that the tightloop benchmark has a lot of frontend and backend stalls (mostly frontend), for a total of 0.92 stalled cycles per instruction. The version with the extra call has just 0.25 stalled cycles per instruction.

So would it be right to assume that the tight loop stalls on loading from counter because the rest of the instructions in the loop depend on its value? So how does the call and ret help here? By providing non-data-dependent instructions that can be run in parallel while the others are stalled? Still, whatever that is, I find this result astonishing.

Let me know if you have any insights.