TL;DR

Reverse loops are not faster, using the byte code as an indication of performance is a really bad idea. Benchmark!

Intro

On the 13th of June @TravCav published on medium an article arguing that reverse loops are faster than a regular loop.

for ( int i = 0 ; i < 10 ; i ++) { System . out . println ( i ); }

was slower than

for ( int i = 10 ; i >= 0 ; i --) { System . out . println ( i ); }

the code is from @TravCav post, the loop should start at 9 not 10 as pointed on reddit

His analysis is based on looking at the byte generated, but unfortunately, it does not provide any benchmark to confirm his hypothesis.

I don’t think it is the first time I heard about that idea, it generally comes from the belief that a comparison to 0 is faster. That might be true or might have been true who knows.

Is it true now? is it even relevant?

Most of the CPUs can execute multiple instructions in parallel, the main cost of a jump is when the branch prediction is wrong. Even if a cmp with 0 is faster, the author assumes that the bytecode is relevant to what the CPU will execute. Big news it is not, the JIT does a lot of transformation and making an assumption on what gets executed from the bytecode is ludicrous. For example:

After looping back to line 17, the array object has to get loaded back onto the stack at line 18, and the length has to be retrieved again on line 19. And it does this for every loop. Over and over and over…

No it does not, the JIT is smart enough to figure out that the length does not vary and it will load it only once. It will also remove the boundary check for that matter.

The jmh benchmark

Anyway, let’s get some benchmarks in there. The jmh code is as follows:

@State ( Scope . Benchmark ) public class LoopBenchmark { private int [] numbers ; @Param ( value = { "1" , "10" , "1000" , "1000000" }) private int size ; @Setup public void setUp () { numbers = new int [ size ]; Random r = new Random (); for ( int i = 0 ; i < size ; i ++) { numbers [ i ] = r . nextInt (); } } @Benchmark @CompilerControl ( CompilerControl . Mode . DONT_INLINE ) public long testForwardLoop () { long l = 0 ; for ( int i = 0 ; i < numbers . length ; i ++) { l += numbers [ i ]; } return l ; } @Benchmark @CompilerControl ( CompilerControl . Mode . DONT_INLINE ) public long testBackwardLoop () { long l = 0 ; for ( int i = numbers . length - 1 ; i >= 0 ; i --) { l += numbers [ i ]; } return l ; } }

We initialise an array of int with a random number. Each benchmark will sum the array, one with a regular - forward - loop, the other with a reverse - backward - loop.

Compile, run

mvn clean install java -jar target/benchmarks.jar

The results

and here are the results - the higher the better -:

Size 1 10 1000 1000000 Backward 240665052 146681879 3686316 3761 Forward 249774234 154017527 3722753 3763 (F - B/F) 3.65% 4.76% 0.98% 0.06%

The reverse loop is 3 to 4% slower on small array the difference reduces when the size of the array increases.

Contrary to the original post, regular loop is faster.

the asm

Lets have a deeper look at what happens with an array of size 10 by looking at asm generated by the jit for forward

java -jar target/benchmarks.jar -f 1 -psize = 10 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly" testForwardLoop

You will need the hsdis library. Load the log through JITWatch.

forward asm

We can see the core loop from line 63 to line 69

L0002: movslq 0x10(%r11,%r8,4),%r10 0x00007f7cad1df4d9: add %r10,%rax ;*ladd ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64) 0x00007f7cad1df4dc: inc %r8d ;*iinc ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63) 0x00007f7cad1df4df: cmp %ebx,%r8d 0x00007f7cad1df4e2: jl L0002 ;*if_icmpge ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)

the exit of the method line 71 to line 75

L0003: add $0x10,%rsp 0x00007f7cad1df4e8: pop %rbp 0x00007f7cad1df4e9: test %eax,0x17b81b11(%rip) # 0x00007f7cc4d61000 ; {poll_return} *** SAFEPOINT POLL *** 0x00007f7cad1df4ef: retq

but what is happening from line 44 to line 58 ?

L0000: movslq 0x10(%r11,%r8,4),%r9 0x00007f7cad1df4a5: add %rax,%r9 0x00007f7cad1df4a8: movslq %r8d,%rcx 0x00007f7cad1df4ab: movslq 0x1c(%r11,%rcx,4),%rax 0x00007f7cad1df4b0: movslq 0x14(%r11,%rcx,4),%rdi 0x00007f7cad1df4b5: movslq 0x18(%r11,%rcx,4),%rcx 0x00007f7cad1df4ba: add %r9,%rdi 0x00007f7cad1df4bd: add %rdi,%rcx 0x00007f7cad1df4c0: add %rcx,%rax ;*ladd ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64) 0x00007f7cad1df4c3: add $0x4,%r8d ;*iinc ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63) 0x00007f7cad1df4c7: cmp %r10d,%r8d 0x00007f7cad1df4ca: jl L0000 ;*if_icmpge ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63)

it’s looping over 4 elements at a time and adding them together in one pass. The loop has been unrolled until there are less that 4 elements available falling back on the one by one loop on line 63.

backward asm

java -jar target/benchmarks.jar -f 1 -psize = 10 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintAssembly" testBackwardLoop

and backward.

We can see the core loop from line 68 to line 74

L0004: movslq 0x10(%r10,%rcx,4),%r11 0x00007fa34c559b15: add %r11,%rax ;*ladd ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74) 0x00007fa34c559b18: dec %ecx ;*iinc ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73) 0x00007fa34c559b1a: cmp $0xffffffff,%ecx 0x00007fa34c559b1d: jg L0004 ;*iflt ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)

very similar the forward version except for the cmp and dec instead of inc.

the exit of the method line 66 to line 80

L0005: add $0x20,%rsp 0x00007fa34c559b23: pop %rbp 0x00007fa34c559b24: test %eax,0x15e7a4d6(%rip) # 0x00007fa3623d4000 ; {poll_return} *** SAFEPOINT POLL *** 0x00007fa34c559b2a: retq

and from line 49 to line 64?

L0001: mov %rax,%r11 ;*lload_1 ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74) L0002: movslq 0x10(%r10,%rcx,4),%r8 0x00007fa34c559ae8: movslq 0xc(%r10,%rcx,4),%r9 0x00007fa34c559aed: movslq 0x8(%r10,%rcx,4),%rbx 0x00007fa34c559af2: movslq 0x4(%r10,%rcx,4),%rax 0x00007fa34c559af7: add %r11,%r8 0x00007fa34c559afa: add %r8,%r9 0x00007fa34c559afd: add %r9,%rbx 0x00007fa34c559b00: add %rbx,%rax ;*ladd ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74) 0x00007fa34c559b03: add $0xfffffffc,%ecx ;*iinc ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73) 0x00007fa34c559b06: cmp $0x2,%ecx 0x00007fa34c559b09: jg L0001 ;*iflt ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73)

same as for forward adding 4 elements at a time.

but the logic to get there is slighlty more complicated in the backward loop

for forward line 22 to line 43 after getting the array length

0x00007f7cad1df458: test %ebx,%ebx 0x00007f7cad1df45a: jle L0004 ;*if_icmpge ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@10 (line 63) 0x00007f7cad1df460: test %ebx,%ebx 0x00007f7cad1df462: jbe L0005 0x00007f7cad1df468: mov %ebx,%r9d 0x00007f7cad1df46b: dec %r9d 0x00007f7cad1df46e: cmp %ebx,%r9d 0x00007f7cad1df471: jae L0005 ;*lload_1 ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@13 (line 64) 0x00007f7cad1df477: movslq 0x10(%r12,%r11,8),%rax ;*i2l ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@20 (line 64) 0x00007f7cad1df47c: mov %ebx,%r10d 0x00007f7cad1df47f: add $0xfffffffd,%r10d 0x00007f7cad1df483: shl $0x3,%r11 0x00007f7cad1df487: mov $0x80000000,%r8d 0x00007f7cad1df48d: cmp %r10d,%r9d 0x00007f7cad1df490: cmovl %r8d,%r10d 0x00007f7cad1df494: mov $0x1,%r8d 0x00007f7cad1df49a: cmp $0x1,%r10d 0x00007f7cad1df49e: jle L0001 ;*lload_1 ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@13 (line 64)

for backward line 22 to line 49

0x00007fa34c559a98: mov %r8d,%ecx 0x00007fa34c559a9b: dec %ecx ;*isub ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@8 (line 73) 0x00007fa34c559a9d: xor %r11d,%r11d 0x00007fa34c559aa0: test %ecx,%ecx 0x00007fa34c559aa2: jl L0006 ;*iflt ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73) 0x00007fa34c559aa8: test %r8d,%r8d 0x00007fa34c559aab: jbe L0007 0x00007fa34c559ab1: cmp %r8d,%ecx 0x00007fa34c559ab4: jae L0007 0x00007fa34c559ab6: add $0xfffffffe,%r8d 0x00007fa34c559aba: shl $0x3,%r10 ;*lload_1 ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74) L0000: movslq 0x10(%r10,%rcx,4),%r9 0x00007fa34c559ac3: add %r9,%r11 ;*ladd ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74) 0x00007fa34c559ac6: dec %ecx ;*iinc ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73) 0x00007fa34c559ac8: cmp %r8d,%ecx 0x00007fa34c559acb: jg L0000 ;*iflt ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@11 (line 73) 0x00007fa34c559acd: cmp $0x2,%ecx 0x00007fa34c559ad0: jle L0008 0x00007fa34c559ad2: jmp L0002 0x00007fa34c559ad4: nopl 0x0(%rax,%rax,1) 0x00007fa34c559adc: data16 data16 xchg %ax,%ax L0001: mov %rax,%r11 ;*lload_1 ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74)

perf analysis

could that be the source of the performance difference?

java -jar target/benchmarks.jar -f 1 -psize = 10 -prof = perf testForwardLoop java -jar target/benchmarks.jar -f 1 -psize = 10 -prof = perf testBackwardLoop

looking at the result of perfasm for forward and backward it does not look like it.

there is though some % difference for the unrolled part of the loop.

frontward

3.71% 3.16% ││││↗ 0x00007fa8d0bcd920: movslq 0x10(%r11,%r8,4),%r9 1.68% 1.46% │││││ 0x00007fa8d0bcd925: add %rax,%r9 0.43% 0.38% │││││ 0x00007fa8d0bcd928: movslq %r8d,%rcx 2.12% 2.11% │││││ 0x00007fa8d0bcd92b: movslq 0x1c(%r11,%rcx,4),%rax 3.52% 3.02% │││││ 0x00007fa8d0bcd930: movslq 0x14(%r11,%rcx,4),%rdi 1.74% 1.14% │││││ 0x00007fa8d0bcd935: movslq 0x18(%r11,%rcx,4),%rcx 0.34% 0.52% │││││ 0x00007fa8d0bcd93a: add %r9,%rdi 2.26% 1.86% │││││ 0x00007fa8d0bcd93d: add %rdi,%rcx 3.85% 3.53% │││││ 0x00007fa8d0bcd940: add %rcx,%rax ;*ladd │││││ ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@21 (line 64) 1.99% 1.70% │││││ 0x00007fa8d0bcd943: add $0x4,%r8d ;*iinc │││││ ; - io.github.arnaudroger.LoopBenchmark::testForwardLoop@23 (line 63) 0.21% 0.32% │││││ 0x00007fa8d0bcd947: cmp %r10d,%r8d ││││╰ 0x00007fa8d0bcd94a: jl 0x00007fa8d0bcd920 ;*if_icmpge

for backward

0.11% │││ │↗ 0x00007fbab51f4be0: mov %rax,%r11 ;*lload_1 │││ ││ ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@14 (line 74) 3.41% 3.05% │││ ↘│ 0x00007fbab51f4be3: movslq 0x10(%r10,%rcx,4),%r8 4.46% 4.11% │││ │ 0x00007fbab51f4be8: movslq 0xc(%r10,%rcx,4),%r9 0.38% 0.43% │││ │ 0x00007fbab51f4bed: movslq 0x8(%r10,%rcx,4),%rbx 0.90% 0.96% │││ │ 0x00007fbab51f4bf2: movslq 0x4(%r10,%rcx,4),%rax 2.13% 2.54% │││ │ 0x00007fbab51f4bf7: add %r11,%r8 5.53% 5.61% │││ │ 0x00007fbab51f4bfa: add %r8,%r9 3.49% 4.52% │││ │ 0x00007fbab51f4bfd: add %r9,%rbx 7.58% 7.83% │││ │ 0x00007fbab51f4c00: add %rbx,%rax ;*ladd │││ │ ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@22 (line 74) 7.76% 8.00% │││ │ 0x00007fbab51f4c03: add $0xfffffffc,%ecx ;*iinc │││ │ ; - io.github.arnaudroger.LoopBenchmark::testBackwardLoop@24 (line 73) 0.03% 0.01% │││ │ 0x00007fbab51f4c06: cmp $0x2,%ecx

look at those 7% in the backward results.

why?

It’s unlikely that the adds cost that much, they either have been stalled or invalidated following a branch misprediction.

Hard to guess there looking a the perf number forward, backward forward has slightly less branch prediction misses. Is it enough to explain the difference, to be honest, I do not really know.

Time to give up on that one, is the branching more complicated? is the going in reverse not playing nice with the cpu? something I would have missed?

Conclusion

Never believe a performance assertion without a good rationale

Never believe a rational without reproducible experiment that supports it

Never believe a benchmark without looking in depth at what is actually benchmarked

Do not deduce performance implication from looking a the bytecode

And no, reverse loops are not faster than regular loops - caveats the result is only applicable to a loop that adds numbers from an array, any generalisation of the results is at your own risk :) -.

If you have an idea as to what explain the difference please leave a comment, Thanks.