Challenge your performance intuition with C++ operators

I made a mistake one day. More of a typo really. I used “bitwise and” instead of “logical and” in the code. A colleague of mine noticed that during the review and said something like this. “You shouldn't use bitwise operators instead of logical ones, the latter do short-circuiting.”

Yes. Both parts of that sentence are valid. You shouldn't use bitwise operators instead of logical ones. Logical operators in C++ perform short-circuit evaluation. That's true. But the invisible implication between the two parts is wrong.

Short-circuiting may be or may be not beneficial for the performance. It may be or may be not performed on bitwise operators too. And sometimes you're better off not doing logic the most logical way to begin with.

I want to propose a game. Let's measure things and see if we can predict the measurements before they come. We did a similar thing last year, and that was fun, so why not?

So here is the benchmark example.

#include <chrono> #include <iostream> #include <random> #include <array> int main() { using TheType = int; constexpr auto TheSize = 16 * 10000000; std::mt19937 rng(0); std::uniform_int_distribution<TheType> distribution(0, 1); std::vector<TheType> xs(TheSize); for (auto& digit : xs) { digit = distribution(rng); } volatile auto four_1_in_a_row = 0u; auto start = std::chrono::system_clock::now(); for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; auto end = std::chrono::system_clock::now(); std::cout << "time: " << (end-start).count() * 1e-9 << " 1111s: " << four_1_in_a_row << "

"; }

I've made several variants of this code, compiled it with clang 3.8.0-2ubuntu4 -std=c++14 -O2, and measured their run time on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz.

Using your intuition and best judgment, please estimate their relative performance. Please use the slider below the code samples.

Round 1. && vs &

The main question. Is && faster than &?

for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 & xs[i+1] == 1 & xs[i+2] == 1 & xs[i+3] == 1) ++four_1_in_a_row;

Reveal the truth

They are almost the same. In fact, they generate almost the same code. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $1, -12(%r12,%rbx,4) jne .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -8(%r12,%rbx,4) jne .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -4(%r12,%rbx,4) jne .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 cmpl $1, (%r12,%rbx,4) jne .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $1, -12(%r12,%rbx,4) jne .LBB0_10 # BB#7: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -8(%r12,%rbx,4) jne .LBB0_10 # BB#8: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -4(%r12,%rbx,4) jne .LBB0_10 # BB#9: # in Loop: Header=BB0_6 Depth=1 cmpl $1, (%r12,%rbx,4) jne .LBB0_10 # BB#20: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .LBB0_10: # %.backedge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 The trick is in the context. Since == generates only 0 or 1, the short-circuiting for 0 and & is still possible. 0 & anything is still 0. Although, the same trick doesn't work with | anymore.

Round 2. & vs *

“Logical and” can also be substituted with a multiplication operator. It has the same truth table on 0 and 1.

for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 & xs[i+1] == 1 & xs[i+2] == 1 & xs[i+3] == 1) ++four_1_in_a_row; for (auto i = 0u; i < TheSize - 3; ++i) if((xs[i] == 1) * (xs[i+1] == 1) * (xs[i+2] == 1) * (xs[i+3] == 1)) ++four_1_in_a_row;

Reveal the truth

Surprisingly, they perform almost the same while their code is completely different. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $1, -12(%r12,%rbx,4) jne .LBB0_10 # BB#7: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -8(%r12,%rbx,4) jne .LBB0_10 # BB#8: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -4(%r12,%rbx,4) jne .LBB0_10 # BB#9: # in Loop: Header=BB0_6 Depth=1 cmpl $1, (%r12,%rbx,4) jne .LBB0_10 # BB#20: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .LBB0_10: # %.backedge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 movl $-1, %eax .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $1, -4(%r12,%rbx,4) sete %cl cmpl $1, -8(%r12,%rbx,4) movl $0, %edx cmovel %eax, %edx cmpl $1, -12(%r12,%rbx,4) movzbl %cl, %ecx je .LBB0_8 # BB#7: # %select.false # in Loop: Header=BB0_6 Depth=1 xorl %edx, %edx .LBB0_8: # %select.end # in Loop: Header=BB0_6 Depth=1 andl %edx, %ecx cmpl $1, (%r12,%rbx,4) sete %dl movzbl %dl, %edx negl %ecx testl %ecx, %edx je .LBB0_9 # BB#19: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .LBB0_9: # %.backedge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 There are 4 comparisons and the rest is just about turning flags into registers. I think we can get rid of that.

Round 3. ==, && vs *, +, -

In this simple example, we can substitute almost all the logic with computation. Instead of comparing each value separately we can calculate the squared error for the whole 4-tuple.

for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; ... inline int sq(int x) { return x*x; } ... for (auto i = 0u; i < TheSize - 3; ++i) if(sq(xs[i] - 1) + sq(xs[i+1] - 1) + sq(xs[i+2] - 1) + sq(xs[i+3] - 1) == 0) ++four_1_in_a_row;

Reveal the truth

And this works perfectly. The code is a few lines longer, but there are no comparisons until the very end. Everything is blunt number crunching. Computers love this kind of code. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $1, -12(%r12,%rbx,4) jne .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -8(%r12,%rbx,4) jne .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -4(%r12,%rbx,4) jne .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 cmpl $1, (%r12,%rbx,4) jne .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 movl -12(%r12,%rbx,4), %eax addl $-1, %eax imull %eax, %eax movl -8(%r12,%rbx,4), %ecx addl $-1, %ecx imull %ecx, %ecx addl %eax, %ecx movl -4(%r12,%rbx,4), %eax addl $-1, %eax imull %eax, %eax addl %ecx, %eax movl (%r12,%rbx,4), %ecx movl $1, %edx subl %ecx, %edx addl $-1, %ecx imull %edx, %ecx cmpl %ecx, %eax jne .LBB0_7 # BB#17: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .LBB0_7: # %.backedge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6

Round 4. * vs abs

We don't really need multiplications. We don't really need squared error. Why don't we calculate an absolute error instead?

... inline int sq(int x) { return x*x; } ... for (auto i = 0u; i < TheSize - 3; ++i) if(sq(xs[i] - 1) + sq(xs[i+1] - 1) + sq(xs[i+2] - 1) + sq(xs[i+3] - 1) == 0) ++four_1_in_a_row; for (auto i = 0u; i < TheSize - 3; ++i) if(std::abs(xs[i] - 1) + std::abs(xs[i+1] - 1) + std::abs(xs[i+2] - 1) + std::abs(xs[i+3] - 1) == 0) ++four_1_in_a_row;

Reveal the truth

That's why. Getting an absolute value of an int is not a trivial job. It's not about clipping a bit off; it is a fair comparison and a conditional move. This pair is heavier than a simple multiplication. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 movl -12(%r12,%rbx,4), %eax addl $-1, %eax imull %eax, %eax movl -8(%r12,%rbx,4), %ecx addl $-1, %ecx imull %ecx, %ecx addl %eax, %ecx movl -4(%r12,%rbx,4), %eax addl $-1, %eax imull %eax, %eax addl %ecx, %eax movl (%r12,%rbx,4), %ecx movl $1, %edx subl %ecx, %edx addl $-1, %ecx imull %edx, %ecx cmpl %ecx, %eax jne .LBB0_7 # BB#17: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .LBB0_7: # %.backedge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 movl -12(%r12,%rbx,4), %eax leal -1(%rax), %ecx movl $1, %edx subl %eax, %edx testl %eax, %eax cmovgl %ecx, %edx movl -8(%r12,%rbx,4), %eax leal -1(%rax), %ecx movl $1, %esi subl %eax, %esi testl %eax, %eax cmovgl %ecx, %esi addl %edx, %esi movl -4(%r12,%rbx,4), %eax leal -1(%rax), %ecx movl $1, %edx subl %eax, %edx testl %eax, %eax cmovgl %ecx, %edx addl %esi, %edx movl (%r12,%rbx,4), %eax leal -1(%rax), %ecx movl $1, %esi subl %eax, %esi testl %eax, %eax cmovgl %ecx, %esi addl %edx, %esi jne .LBB0_7 # BB#17: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .LBB0_7: # %.backedge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 What's interesting, in floating point numbers, when getting an absolute value is about clipping a bit off, there are no tests or conditions.

Round 5. int vs double

The question is, will there be any gain if we simply translate our code to work with doubles?

... using TheType = int; ... for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; ... using TheType = double; ... for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row;

Reveal the truth

Not really, no. The operations are different, but essentially the code is the same. It's still about comparing and jumping. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $1, -12(%r12,%rbx,4) jne .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -8(%r12,%rbx,4) jne .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -4(%r12,%rbx,4) jne .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 cmpl $1, (%r12,%rbx,4) jne .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 vmovsd -24(%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 vmovsd -16(%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 vmovsd -8(%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 vmovsd (%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6

Round 6. *, +, - int vs double

But if we change the code to avoid logic, will there be any gain?

... inline int sq(int x) { return x*x; } ... using TheType = int; ... for (auto i = 0u; i < TheSize - 3; ++i) if(sq(xs[i] - 1) + sq(xs[i+1] - 1) + sq(xs[i+2] - 1) + sq(xs[i+3] - 1) == 0) ++four_1_in_a_row; ... inline double sq(int x) { return x*x; } ... using TheType = double; ... for (auto i = 0u; i < TheSize - 3; ++i) if(sq(xs[i] - 1) + sq(xs[i+1] - 1) + sq(xs[i+2] - 1) + sq(xs[i+3] - 1) == 0) ++four_1_in_a_row;

Reveal the truth

No. The code is conceptually the same though, so the loss is also not that drastic. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 movl -12(%r12,%rbx,4), %eax addl $-1, %eax imull %eax, %eax movl -8(%r12,%rbx,4), %ecx addl $-1, %ecx imull %ecx, %ecx addl %eax, %ecx movl -4(%r12,%rbx,4), %eax addl $-1, %eax imull %eax, %eax addl %ecx, %eax movl (%r12,%rbx,4), %ecx movl $1, %edx subl %ecx, %edx addl $-1, %ecx imull %edx, %ecx cmpl %ecx, %eax jne .LBB0_7 # BB#17: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .LBB0_7: # %.backedge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 vmovsd .LCPI1_0(%rip), %xmm0 vxorps %xmm1, %xmm1, %xmm1 .align 16, 0x90 .LBB1_6: # =>This Inner Loop Header: Depth=1 vaddsd -24(%r12,%rbx,8), %xmm0, %xmm2 vmulsd %xmm2, %xmm2, %xmm2 vaddsd -16(%r12,%rbx,8), %xmm0, %xmm3 vmulsd %xmm3, %xmm3, %xmm3 vaddsd %xmm3, %xmm2, %xmm2 vaddsd -8(%r12,%rbx,8), %xmm0, %xmm3 vmulsd %xmm3, %xmm3, %xmm3 vaddsd %xmm3, %xmm2, %xmm2 vaddsd (%r12,%rbx,8), %xmm0, %xmm3 vmulsd %xmm3, %xmm3, %xmm3 vaddsd %xmm3, %xmm2, %xmm2 vucomisd %xmm1, %xmm2 jne .LBB1_9 jnp .LBB1_8 jmp .LBB1_9 .LBB1_8: # in Loop: Header=BB1_6 Depth=1 incl 4(%rsp) .LBB1_9: # %.backedge # in Loop: Header=BB1_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB1_6

Round 7. double vs float

That sound a bit like cheating because with single precision types we would have to read two times less from the memory but let's give it a go.

... using TheType = double; ... for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; ... using TheType = float; ... for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row;

Reveal the truth

And that's the most peculiar thing. Although the code now reads much less from the memory, performance wise it's almost identical to its original form. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 vmovsd -24(%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 vmovsd -16(%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 vmovsd -8(%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 vmovsd (%r15,%rbx,8), %xmm0 vucomisd .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 vmovss -12(%r15,%rbx,4), %xmm0 vucomiss .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 vmovss -8(%r15,%rbx,4), %xmm0 vucomiss .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 vmovss -4(%r15,%rbx,4), %xmm0 vucomiss .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 vmovss (%r15,%rbx,4), %xmm0 vucomiss .LCPI0_0(%rip), %xmm0 jne .LBB0_11 jp .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 This has probably something to do with the bus and the CPU working asynchronously. If the CPU is busy enough, the bus load doesn't matter.

Round 8. 1 vs 0

This is the bonus round. There is a notion, right along with the short-circuiting being good for performance, that the comparison with zero is cheaper than any other comparison. Let's try it out.

for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 0 && xs[i+1] == 0 && xs[i+2] == 0 && xs[i+3] == 0) four_0_in_a_row++;

Reveal the truth

Of course, it isn't. It may be contextually beneficial to check against zero because with Intel architecture there is a dedicated zero flag, and most of the time you can read it from the previous operation skipping the comparison part altogether. But in our example, there is no operation to load the zero flag so the code on both sides is almost identical. movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $1, -12(%r12,%rbx,4) jne .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -8(%r12,%rbx,4) jne .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 cmpl $1, -4(%r12,%rbx,4) jne .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 cmpl $1, (%r12,%rbx,4) jne .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6 movq %rax, %r14 .align 16, 0x90 .LBB0_6: # =>This Inner Loop Header: Depth=1 cmpl $0, -12(%r12,%rbx,4) jne .LBB0_11 # BB#7: # in Loop: Header=BB0_6 Depth=1 cmpl $0, -8(%r12,%rbx,4) jne .LBB0_11 # BB#8: # in Loop: Header=BB0_6 Depth=1 cmpl $0, -4(%r12,%rbx,4) jne .LBB0_11 # BB#9: # in Loop: Header=BB0_6 Depth=1 cmpl $0, (%r12,%rbx,4) jne .LBB0_11 # BB#10: # in Loop: Header=BB0_6 Depth=1 incl 4(%rsp) .align 16, 0x90 .LBB0_11: # %._crit_edge # in Loop: Header=BB0_6 Depth=1 addq $1, %rbx cmpq $160000000, %rbx jne .LBB0_6