Every once in a while, I hear how intrinsics have improved enough that it's safe to use them for high performance code. That would be nice. The promise of intrinsics is that you can write optimized code by calling out to functions (intrinsics) that correspond to particular assembly instructions. Since intrinsics act like normal functions, they can be cross platform. And since your compiler has access to more computational power than your brain, as well as a detailed model of every CPU, the compiler should be able to do a better job of micro-optimizations. Despite decade old claims that intrinsics can make your life easier, it never seems to work out.

The last time I tried intrinsics was around 2007; for more on why they were hopeless then (see this exploration by the author of VirtualDub). I gave them another shot recently, and while they've improved, they're still not worth the effort. The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand. If you don't check the results by hand, it's easy to get bad results.

For example, as of this writing, the first two Google hits for popcnt benchmark (and 2 out of the top 3 bing hits) claim that Intel's hardware popcnt instruction is slower than a software implementation that counts the number of bits set in a buffer, via a table lookup using the SSSE3 pshufb instruction. This turns out to be untrue, but it must not be obvious, or this claim wouldn't be so persistent. Let's see why someone might have come to the conclusion that the popcnt instruction is slow if they coded up a solution using intrinsics.

One of the top search hits has sample code and benchmarks for both native popcnt as well as the software version using pshufb . Their code requires MSVC, which I don't have access to, but their first popcnt implementation just calls the popcnt intrinsic in a loop, which is fairly easy to reproduce in a form that gcc and clang will accept. Timing it is also pretty simple, since we're just timing a function (that happens to count the number of bits set in some fixed sized buffer).

uint32_t builtin_popcnt(const uint64_t* buf, int len) { int cnt = 0; for (int i = 0; i < len; ++i) { cnt += __builtin_popcountll(buf[i]); } return cnt; }

This is slightly different from the code I linked to above, since they use the dword (32-bit) version of popcnt , and we're using the qword (64-bit) version. Since our version gets twice as much done per loop iteration, I'd expect our version to be faster than their version.

Running clang -O3 -mpopcnt -funroll-loops produces a binary that we can examine. On macs, we can use otool -tv to get the disassembly. On linux, there's objdump -d .

_builtin_popcnt: ; address instruction 0000000100000b30 pushq %rbp 0000000100000b31 movq %rsp, %rbp 0000000100000b34 movq %rdi, -0x8(%rbp) 0000000100000b38 movl %esi, -0xc(%rbp) 0000000100000b3b movl $0x0, -0x10(%rbp) 0000000100000b42 movl $0x0, -0x14(%rbp) 0000000100000b49 movl -0x14(%rbp), %eax 0000000100000b4c cmpl -0xc(%rbp), %eax 0000000100000b4f jge 0x100000bd4 0000000100000b55 movslq -0x14(%rbp), %rax 0000000100000b59 movq -0x8(%rbp), %rcx 0000000100000b5d movq (%rcx,%rax,8), %rax 0000000100000b61 movq %rax, %rcx 0000000100000b64 shrq %rcx 0000000100000b67 movabsq $0x5555555555555555, %rdx 0000000100000b71 andq %rdx, %rcx 0000000100000b74 subq %rcx, %rax 0000000100000b77 movabsq $0x3333333333333333, %rcx 0000000100000b81 movq %rax, %rdx 0000000100000b84 andq %rcx, %rdx 0000000100000b87 shrq $0x2, %rax 0000000100000b8b andq %rcx, %rax 0000000100000b8e addq %rax, %rdx 0000000100000b91 movq %rdx, %rax 0000000100000b94 shrq $0x4, %rax 0000000100000b98 addq %rax, %rdx 0000000100000b9b movabsq $0xf0f0f0f0f0f0f0f, %rax 0000000100000ba5 andq %rax, %rdx 0000000100000ba8 movabsq $0x101010101010101, %rax 0000000100000bb2 imulq %rax, %rdx 0000000100000bb6 shrq $0x38, %rdx 0000000100000bba movl %edx, %esi 0000000100000bbc movl -0x10(%rbp), %edi 0000000100000bbf addl %esi, %edi 0000000100000bc1 movl %edi, -0x10(%rbp) 0000000100000bc4 movl -0x14(%rbp), %eax 0000000100000bc7 addl $0x1, %eax 0000000100000bcc movl %eax, -0x14(%rbp) 0000000100000bcf jmpq 0x100000b49 0000000100000bd4 movl -0x10(%rbp), %eax 0000000100000bd7 popq %rbp 0000000100000bd8 ret

Well, that's interesting. Clang seems to be calculating things manually rather than using popcnt . It seems to be using the approach described here, which is something like

x = x - ((x >> 0x1) & 0x5555555555555555); x = (x & 0x3333333333333333) + ((x >> 0x2) & 0x3333333333333333); x = (x + (x >> 0x4)) & 0xF0F0F0F0F0F0F0F; ans = (x * 0x101010101010101) >> 0x38;

That's not bad for a simple implementation that doesn't rely on any kind of specialized hardware, but that's going to take a lot longer than a single popcnt instruction.

I've got a pretty old version of clang (3.0), so let me try this again after upgrading to 3.4, in case they added hardware popcnt support “recently”.

0000000100001340 pushq %rbp ; save frame pointer 0000000100001341 movq %rsp, %rbp ; new frame pointer 0000000100001344 xorl %ecx, %ecx ; cnt = 0 0000000100001346 testl %esi, %esi 0000000100001348 jle 0x100001363 000000010000134a nopw (%rax,%rax) 0000000100001350 popcntq (%rdi), %rax ; “eax” = popcnt[rdi] 0000000100001355 addl %ecx, %eax ; eax += cnt 0000000100001357 addq $0x8, %rdi ; increment address by 64-bits (8 bytes) 000000010000135b decl %esi ; decrement loop counter; sets flags 000000010000135d movl %eax, %ecx ; cnt = eax; does not set flags 000000010000135f jne 0x100001350 ; examine flags. if esi != 0, goto popcnt 0000000100001361 jmp 0x100001365 ; goto “restore frame pointer” 0000000100001363 movl %ecx, %eax 0000000100001365 popq %rbp ; restore frame pointer 0000000100001366 ret

That's better! We get a hardware popcnt ! Let's compare this to the SSSE3 pshufb implementation presented here as the fastest way to do a popcnt. We'll use a table like the one in the link to show speed, except that we're going to show a rate, instead of the raw cycle count, so that the relative speed between different sizes is clear. The rate is GB/s, i.e., how many gigs of buffer we can process per second. We give the function data in chunks (varying from 1kb to 16Mb); each column is the rate for a different chunk-size. If we look at how fast each algorithm is for various buffer sizes, we get the following.

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5 PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6

That's not so great. Relative to the the benchmark linked above, we're doing better because we're using 64-bit popcnt instead of 32-bit popcnt , but the PSHUFB version is still almost twice as fast .

One odd thing is the way cnt gets accumulated. cnt is stored in ecx . But, instead of adding the result of the popcnt to ecx , clang has decided to add ecx to the result of the popcnt . To fix that, clang then has to move that sum into ecx at the end of each loop iteration.

The other noticeable problem is that we only get one popcnt per iteration of the loop, which means the loop isn't getting unrolled, and we're paying the entire cost of the loop overhead for each popcnt . Unrolling the loop can also let the CPU extract more instruction level parallelism from the code, although that's a bit beyond the scope of this blog post.

Using clang, that happens even with -O3 -funroll-loops . Using gcc, we get a properly unrolled loop, but gcc has other problems, as we'll see later. For now, let's try unrolling the loop ourselves by calling __builtin_popcountll multiple times during each iteration of the loop. For simplicity, let's try doing four popcnt operations on each iteration. I don't claim that's optimal, but it should be an improvement.

uint32_t builtin_popcnt_unrolled(const uint64_t* buf, int len) { assert(len % 4 == 0); int cnt = 0; for (int i = 0; i < len; i+=4) { cnt += __builtin_popcountll(buf[i]); cnt += __builtin_popcountll(buf[i+1]); cnt += __builtin_popcountll(buf[i+2]); cnt += __builtin_popcountll(buf[i+3]); } return cnt; }

The core of our loop now has

0000000100001390 popcntq (%rdi,%rcx,8), %rdx 0000000100001396 addl %eax, %edx 0000000100001398 popcntq 0x8(%rdi,%rcx,8), %rax 000000010000139f addl %edx, %eax 00000001000013a1 popcntq 0x10(%rdi,%rcx,8), %rdx 00000001000013a8 addl %eax, %edx 00000001000013aa popcntq 0x18(%rdi,%rcx,8), %rax 00000001000013b1 addl %edx, %eax

with pretty much the same code surrounding the loop body. We're doing four popcnt operations every time through the loop, which results in the following performance:

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5 PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6 Unrolled 12.5 14.4 15.0 15.1 15.2 15.2 15.2 15.2

Between using 64-bit popcnt and unrolling the loop, we've already beaten the allegedly faster pshufb code! But it's close enough that we might get different results with another compiler or some other chip. Let's see if we can do better.

So, what's the deal with this popcnt false dependency bug that's been getting a lot of publicity lately? Turns out, popcnt has a false dependency on its destination register, which means that even though the result of popcnt doesn't depend on its destination register, the CPU thinks that it does and will wait until the destination register is ready before starting the popcnt instruction.

x86 typically has two operand operations, e.g., addl %eax, %edx adds eax and edx , and then places the result in edx , so it's common for an operation to have a dependency on its output register. In this case, there shouldn't be a dependency, since the result doesn't depend on the contents of the output register, but that's an easy bug to introduce, and a hard one to catch .

In this particular case, popcnt has a 3 cycle latency, but it's pipelined such that a popcnt operation can execute each cycle. If we ignore other overhead, that means that a single popcnt will take 3 cycles, 2 will take 4 cycles, 3 will take 5 cycles, and n will take n+2 cycles, as long as the operations are independent. But, if the CPU incorrectly thinks there's a dependency between them, we effectively lose the ability to pipeline the instructions, and that n+2 turns into 3n.

We can work around this by buying a CPU from AMD or VIA, or by putting the popcnt results in different registers. Let's making an array of destinations, which will let us put the result from each popcnt into a different place.

uint32_t builtin_popcnt_unrolled_errata(const uint64_t* buf, int len) { assert(len % 4 == 0); int cnt[4]; for (int i = 0; i < 4; ++i) { cnt[i] = 0; } for (int i = 0; i < len; i+=4) { cnt[0] += __builtin_popcountll(buf[i]); cnt[1] += __builtin_popcountll(buf[i+1]); cnt[2] += __builtin_popcountll(buf[i+2]); cnt[3] += __builtin_popcountll(buf[i+3]); } return cnt[0] + cnt[1] + cnt[2] + cnt[3]; }

And now we get

0000000100001420 popcntq (%rdi,%r9,8), %r8 0000000100001426 addl %ebx, %r8d 0000000100001429 popcntq 0x8(%rdi,%r9,8), %rax 0000000100001430 addl %r14d, %eax 0000000100001433 popcntq 0x10(%rdi,%r9,8), %rdx 000000010000143a addl %r11d, %edx 000000010000143d popcntq 0x18(%rdi,%r9,8), %rcx

That's better -- we can see that the first popcnt outputs into r8 , the second into rax , the third into rdx , and the fourth into rcx . However, this does the same odd accumulation as the original, where instead of adding the result of the popcnt to cnt[i] , it does the opposite, which necessitates moving the results back to cnt[i] afterwards.

000000010000133e movl %ecx, %r10d 0000000100001341 movl %edx, %r11d 0000000100001344 movl %eax, %r14d 0000000100001347 movl %r8d, %ebx

Well, at least in clang (3.4). Gcc (4.8.2) is too smart to fall for this separate destination thing and “optimizes” the code back to something like our original version.

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5 PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6 Unrolled 12.5 14.4 15.0 15.1 15.2 15.2 15.2 15.2 Unrolled 2 14.3 16.3 17.0 17.2 17.2 17.0 16.8 16.7

To get a version that works with both gcc and clang, and doesn't have these extra mov s, we'll have to write the assembly by hand :

uint32_t builtin_popcnt_unrolled_errata_manual(const uint64_t* buf, int len) { assert(len % 4 == 0); uint64_t cnt[4]; for (int i = 0; i < 4; ++i) { cnt[i] = 0; } for (int i = 0; i < len; i+=4) { __asm__( "popcnt %4, %4

\ "add %4, %0

\t" "popcnt %5, %5

\t" "add %5, %1

\t" "popcnt %6, %6

\t" "add %6, %2

\t" "popcnt %7, %7

\t" "add %7, %3

\t" // +r means input/output, r means intput : "+r" (cnt[0]), "+r" (cnt[1]), "+r" (cnt[2]), "+r" (cnt[3]) : "r" (buf[i]), "r" (buf[i+1]), "r" (buf[i+2]), "r" (buf[i+3])); } return cnt[0] + cnt[1] + cnt[2] + cnt[3]; }

This directly translates the assembly into the loop:

00000001000013c3 popcntq %r10, %r10 00000001000013c8 addq %r10, %rcx 00000001000013cb popcntq %r11, %r11 00000001000013d0 addq %r11, %r9 00000001000013d3 popcntq %r14, %r14 00000001000013d8 addq %r14, %r8 00000001000013db popcntq %rbx, %rbx

Great! The add s are now going the right direction, because we specified exactly what they should do.

Algorithm 1k 4k 16k 65k 256k 1M 4M 16M Intrinsic 6.9 7.3 7.4 7.5 7.5 7.5 7.5 7.5 PSHUFB 11.5 13.0 13.3 13.4 13.1 13.4 13.0 12.6 Unrolled 12.5 14.4 15.0 15.1 15.2 15.2 15.2 15.2 Unrolled 2 14.3 16.3 17.0 17.2 17.2 17.0 16.8 16.7 Assembly 17.5 23.7 25.3 25.3 26.3 26.3 25.3 24.3

Finally! A version that blows away the PSHUFB implementation. How do we know this should be the final version? We can see from Agner's instruction tables that we can execute, at most, one popcnt per cycle. I happen to have run this on a 3.4Ghz Sandy Bridge, so we've got an upper bound of 8 bytes / cycle * 3.4 G cycles / sec = 27.2 GB/s . That's pretty close to the 26.3 GB/s we're actually getting, which is a sign that we can't make this much faster .

In this case, the hand coded assembly version is about 3x faster than the original intrinsic loop (not counting the version from a version of clang that didn't emit a popcnt ). It happens that, for the compiler we used, the unrolled loop using the popcnt intrinsic is a bit faster than the pshufb version, but that wasn't true of one of the two unrolled versions when I tried this with gcc .

It's easy to see why someone might have benchmarked the same code and decided that popcnt isn't very fast. It's also easy to see why using intrinsics for performance critical code can be a huge time sink .

Thanks to Scott for some comments on the organization of this post, and to Leah for extensive comments on just about everything

If you liked this, you'll probably enjoy this post about how CPUs have changed since the 80s.