Using logical operators for logical operations is good

My last piece Challenge your performance intuition with C++ operators was about how the context matters more than tricks. Unfortunately, I didn't make my point clear enough for which I'm truly sorry.

It might look like I advocate using math in favor of logic because of the performance gain. Well, I do. But only if a context is well-known and it is not going to change. This is the whole point. The context matters.

I find trickery appropriate when you know your hardware, and your compiler, and you are ready to redo your code from scratch when something changes. This might be the case if you run some computationally heavy algorithm in the cloud. Your environment is predetermined, and you pay per minute, so it makes sense to squeeze every penny from what you got.

But in general case, you should use logical operations to do logic. Not because the short-circuiting, this is also a context-dependent trick, but because in general case compilers do the trickery better than we humans do.

Let's redo a few rounds. The benchmark is the same, the questions are the same. The compiler is the same. The only thing that changes is the platform. This is now CHIP with ARMv7.

This is the original benchmark. I only reduced the number of operations tenfold because the machine itself is much slower.

#include <chrono> #include <iostream> #include <random> #include <array> int main() { using TheType = int; constexpr auto TheSize = 16 * 1000000; std::mt19937 rng(0); std::uniform_int_distribution<TheType> distribution(0, 1); std::vector<TheType> xs(TheSize); for (auto& digit : xs) { digit = distribution(rng); } volatile auto four_1_in_a_row = 0u; auto start = std::chrono::system_clock::now(); for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; auto end = std::chrono::system_clock::now(); std::cout << "time: " << (end-start).count() * 1e-9 << " 1111s: " << four_1_in_a_row << "

"; }

Just like the last time, using your intuition and best judgment, please estimate the relative performance of the code snippets from below.

Round 1. && vs &

The same question. Is && faster than &?

for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 & xs[i+1] == 1 & xs[i+2] == 1 & xs[i+3] == 1) ++four_1_in_a_row;

Reveal the truth

Of course, not. We didn't change the compiler, only the processor. Clang still can do short-circuiting on 0 and &&. movw r0, #36852 movt r0, #976 .LBB0_6: ldr r1, [r4, r5] cmp r1, #1 bne .LBB0_9 @ BB#7: add r1, r4, r5 ldr r2, [r1, #4] cmp r2, #1 ldreq r2, [r1, #8] cmpeq r2, #1 bne .LBB0_9 @ BB#8: ldr r1, [r1, #12] cmp r1, #1 ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] .LBB0_9: add r5, r5, #4 cmp r5, r0 bne .LBB0_6 @ BB#10: mov r0, sp movw r0, #36852 movt r0, #976 .LBB0_6: mov r1, r4 ldr r2, [r1, r5]! cmp r2, #1 bne .LBB0_9 @ BB#7: ldr r2, [r1, #4] cmp r2, #1 ldreq r2, [r1, #8] cmpeq r2, #1 bne .LBB0_9 @ BB#8: ldr r1, [r1, #12] cmp r1, #1 ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] .LBB0_9: add r5, r5, #4 cmp r5, r0 bne .LBB0_6 @ BB#10: mov r0, sp The code is still almost the same on both sides. Although, what's interesting, not all the comparisons result in short-circuiting.

Round 2. ==, && vs *, +, -

On Intel, substituting logic with arithmetics gave a noticeable gain. Will the trick work on ARMv7?

for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; ... inline int sq(int x) { return x*x; } ... for (auto i = 0u; i < TheSize - 3; ++i) if(sq(xs[i] - 1) + sq(xs[i+1] - 1) + sq(xs[i+2] - 1) + sq(xs[i+3] - 1) == 0) ++four_1_in_a_row;

Reveal the truth

No. We even lost a bit. movw r0, #36852 movt r0, #976 .LBB0_6: ldr r1, [r4, r5] cmp r1, #1 bne .LBB0_9 @ BB#7: add r1, r4, r5 ldr r2, [r1, #4] cmp r2, #1 ldreq r2, [r1, #8] cmpeq r2, #1 bne .LBB0_9 @ BB#8: ldr r1, [r1, #12] cmp r1, #1 ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] .LBB0_9: add r5, r5, #4 cmp r5, r0 bne .LBB0_6 @ BB#10: mov r0, sp movw r0, #36852 movt r0, #976 .LBB0_6: mov r1, r4 ldr r2, [r1, r5]! add r5, r5, #4 sub r2, r2, #1 ldmib r1, {r3, r7} mul r2, r2, r2 sub r3, r3, #1 ldr r1, [r1, #12] mla r2, r3, r3, r2 sub r6, r1, #1 rsb r1, r1, #1 sub r3, r7, #1 mul r1, r1, r6 mla r2, r3, r3, r2 cmp r2, r1 ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] cmp r5, r0 bne .LBB0_6 @ BB#7: mov r0, sp The code on the right is still plain computations, just like with the Intel i7. But ARMv7 doesn't have as many pipelines to keep busy as Intel does. It pays less penalty on branching.

Round 3. * vs abs

With Intel, switching multiplication to absolute value results in a noticeable loss. How will ARM do?

... inline int sq(int x) { return x*x; } ... for (auto i = 0u; i < TheSize - 3; ++i) if(sq(xs[i] - 1) + sq(xs[i+1] - 1) + sq(xs[i+2] - 1) + sq(xs[i+3] - 1) == 0) ++four_1_in_a_row; for (auto i = 0u; i < TheSize - 3; ++i) if(std::abs(xs[i] - 1) + std::abs(xs[i+1] - 1) + std::abs(xs[i+2] - 1) + std::abs(xs[i+3] - 1) == 0) ++four_1_in_a_row;

Reveal the truth

There is a slight gain instead. movw r0, #36852 movt r0, #976 .LBB0_6: mov r1, r4 ldr r2, [r1, r5]! add r5, r5, #4 sub r2, r2, #1 ldmib r1, {r3, r7} mul r2, r2, r2 sub r3, r3, #1 ldr r1, [r1, #12] mla r2, r3, r3, r2 sub r6, r1, #1 rsb r1, r1, #1 sub r3, r7, #1 mul r1, r1, r6 mla r2, r3, r3, r2 cmp r2, r1 ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] cmp r5, r0 bne .LBB0_6 @ BB#7: mov r0, sp movw r0, #36852 movt r0, #976 .LBB0_6: mov r1, r4 ldr r2, [r1, r5]! add r5, r5, #4 ldmib r1, {r3, r7} rsb r6, r2, #1 cmp r2, #0 subgt r6, r2, #1 rsb r2, r3, #1 cmp r3, #0 ldr r1, [r1, #12] subgt r2, r3, #1 rsb r3, r7, #1 cmp r7, #0 add r2, r2, r6 subgt r3, r7, #1 cmp r1, #0 add r2, r2, r3 rsb r3, r1, #1 subgt r3, r1, #1 cmn r2, r3 ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] cmp r5, r0 bne .LBB0_6 @ BB#7: mov r0, sp Again, the code for ARM is not conceptually different from what we saw for Intel. But conditional instructions are not that of a burden. So the absolute value version wins.

Round 4. int vs float

On Intel, double and int versions work almost the same. Since ARMv7 is a 32-bit processor, it would be fair to compare the int version with the float one. Let's do that.

... using TheType = int; ... for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row; ... using TheType = float; ... for (auto i = 0u; i < TheSize - 3; ++i) if(xs[i] == 1 && xs[i+1] == 1 && xs[i+2] == 1 && xs[i+3] == 1) ++four_1_in_a_row;

Reveal the truth

And the float version is significantly slower. movw r0, #36852 movt r0, #976 .LBB0_6: ldr r1, [r4, r5] cmp r1, #1 bne .LBB0_9 @ BB#7: add r1, r4, r5 ldr r2, [r1, #4] cmp r2, #1 ldreq r2, [r1, #8] cmpeq r2, #1 bne .LBB0_9 @ BB#8: ldr r1, [r1, #12] cmp r1, #1 ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] .LBB0_9: add r5, r5, #4 cmp r5, r0 bne .LBB0_6 @ BB#10: mov r0, sp vmov.f32 s0, #1.000000e+00 movw r0, #36852 movt r0, #976 .LBB0_6: add r1, r4, r5 vldr s2, [r1] vcmpe.f32 s2, s0 vmrs APSR_nzcv, fpscr bne .LBB0_10 @ BB#7: vldr s2, [r1, #4] vcmpe.f32 s2, s0 vmrs APSR_nzcv, fpscr bne .LBB0_10 @ BB#8: vldr s2, [r1, #8] vcmpe.f32 s2, s0 vmrs APSR_nzcv, fpscr bne .LBB0_10 @ BB#9: vldr s2, [r1, #12] vcmpe.f32 s2, s0 vmrs APSR_nzcv, fpscr ldreq r1, [sp, #20] addeq r1, r1, #1 streq r1, [sp, #20] .LBB0_10: add r5, r5, #4 cmp r5, r0 bne .LBB0_6 @ BB#11: mov r0, sp The code is fine. It's still about comparing and jumping. But the floating point operations themselves are a bit pricey on ARM.

Conclusion

The context matters. Unless you are willing to optimize for the specific platform, paying greatly in terms of maintenance and portability, writing simple code is the best strategy.