C compilers are quite intelligent when it comes to performing optimization tricks. When compiling code with something like GCC, you get man-centuries worth of optimization passes applied before seeing the output. Unfortunately, the compiler isn't always smart enough in the way we might want and it can't optimize code in a way we might expect, forcing the programmer to perform some "obvious" tricks by hand. Sometimes, these hand tricks result in a less desirable end result. In this post, we will explore one instance where worse-looking code (to a human) results in better-looking code to a CPU.

To assist us in this exploration, we are going to be discussing my favorite error-detection code: CRC32. This wonderful little algorithm shows up in all sorts of places, from ethernet, to PNG, to iSCSI, to btrfs, and beyond. It also has a number of properties which make it ideal for a blog post.

It is easy to understand the requirements.

There are SSE instructions to optimize it.

All functions will have a signature which looks like this (sans documentation):

/** Compute the CRC-32C (Castagnoli polynomial) for data in range [first, last). * * \param first Beginning of the data range. * \param last One past the end of the data. * \returns The computed checksum. **/ std::uint32_t crc32(const char* first, const char* last);

There are two things to take note of here. First is the use of half-open range [first, last) to behave like a proper C++ function. Second is the word "Castagnoli," which is both a fried Italian dough ball and the name of a guy who invented a better transformation for 32-bit CRC (CRC is a family of algorithms, all of which use a polynomial transformation; for example, CRC-1 with polynomial \(x + 1\) is commonly known as a "parity bit"). CRC-32C has better error-detection capabilities than the "classic" CRC-32 polynomial, so you will find it used more frequently in "modern" systems like ext4 and Ceph.

Implementations

Implementation performance will be measured through microbenchmarking. The input ranges will be of sizes \(\{2^i : 8 \leq i \leq 20\}\) (aka \(\{256, 512, 1024, \ldots, 1048576\}\)). All tests were run on my Intel Core i7 6700K, a 4.0 GHz processor with a Skylake core. Unless otherwise specified, the compiler used was g++ 6.3.1 with the flags -O3 and -fwhole-program . The source code is available on Gist as crc32.cpp .

Boost

The simplest algorithm is one that has already been written for you. To that end, let's implement our CRC in terms of boost::crc_optimal .

#include <boost/crc.hpp> std::uint32_t crc32_boost(const char* first, const char* last) { boost::crc_optimal<32, 0x1edc6f41, 0xffffffff, 0xffffffff, true, true> machine; machine.process_block(first, last); return machine.checksum(); }

Internally, boost::crc_optimal<...>::process_block uses a simple lookup table constructed on the first use of boost::crc_optimal for the provided parameters. Removing some of the generic-ness and compiler workarounds, crc32_boost ends up looking something like this:

std::uint32_t crc32_boost(const char* first, const char* last) { static std::uint32_t table[256] = compute_crc32_lookups(32, 0x1edc6f41, 0xffffffff); std::uint32_t rem = ~0U; for (const char* p = first; p < last; ++p) { int lookup_index = std::uint8(rem ^ *p); rem >>= 8; rem ^= table[lookup_index]; } return ~rem; }

Alright, let's look at the results in terms of nanoseconds per byte of processed data.

The first takeaway is CRC32 is pretty fast -- only around 2 nanoseconds per processed byte. As usual with these sorts of things, there is a steep drop as the input size grows. There is an odd spike at 32 KiB, which likely comes from my CPU having a 32 KiB L1 cache.

A: Naïve crc32b

Let's start improving the performance of the algorithm! To do this, we are going to use the CPU instructions specific to CRC-32C. SSE 4.2 adds a crc32 instruction, which AT&T assembly exposes in 4 instructions: crc32b (8-bit bytes), crc32w (16-bit words), crc32l (32-bit longs) and crc32q (64-bit quads). To start, lets just use the byte -sized version.

#include <cstdint> #include <smmintrin.h> __attribute__((target("sse4.2"))) std::uint32_t crc32a(const char* first, const char* last) { std::uint32_t code = ~0U; for ( ; first != last; ++first) code = _mm_crc32_u8(code, *first); return ~code; }

The use of __attribute__((target("sse4.2"))) allows the crc32a function to be compiled with support for the sse4.2 target (similar to specifying -msse4.2 on the command line). This can be useful for providing fallback mechanisms when running on systems which do not support SSE 4.2.

So...it's approximately 2.5 times faster than the boost::crc_optimal implementation. It has the same shape at the Boost implementation, with the small hump at 32 KiB.

B: std::for_each with crc32b

Let's talk template s!

#include <algorithm> #include <cstdint> #include <smmintrin.h> __attribute__((target("sse4.2"))) std::uint32_t crc32b(const char* first, const char* last) { std::uint32_t code = ~0U; std::for_each(first, last, [&] (const char& x) { code = _mm_crc32_u8(code, x); } ); return ~code; }

If you're following along at home with GCC 6.3.1 (and many other versions), you will not be able to compile that code. It will fail with a semi-aggressive compilation error that looks a bit like this:

In file included from crc32.cpp:26:0: /usr/lib64/gcc/x86_64-suse-linux/6/include/smmintrin.h: In lambda function: /usr/lib64/gcc/x86_64-suse-linux/6/include/smmintrin.h:827:1: error: inlining failed in call to always_inline ‘unsigned int _mm_crc32_u8(unsigned int, unsigned char)’: target specific option mismatch _mm_crc32_u8 (unsigned int __C, unsigned char __V) crc32.cpp:65:60: note: called from here [&] (const char& x) { code = _mm_crc32_u8(code, x); }

The issue here is that while the crc32b function is compiled with __attribute__((target("sse4.2"))) , the lambda function crc32b(char const*, char const*)::{lambda(char const*)#1}::operator()(char const*) const is not. Unfortunately in GCC, lambda functions do not inherit the function-specific compilation options of the containing function (this works properly in Clang). The fix is to add the attribute to the lambda function as well:

[&] (const char& x) __attribute__((target("sse4.2"))) { code = _mm_crc32_u8(code, x); }

They're basically the same shape...I'm willing to attribute any differences to sampling error. It looks like the use of std::for_each has no effect on performance!

C: Opportunistic Strides

Going byte-by-byte might be easy, but this is not the most efficient way to use the SSE instructions. The entire point of SSE is to perform single instructions to process multiple data. Here, the "multiple data" refers to multiple bytes; the "single instructions" to perform this processing are the crc32_ instructions which accept larger parameters than a byte. If we want to make our CRC implementation faster, we need to process larger chunks of input at a time. Let's build on crc32a and advance in 2-, 4-, and 8-byte increments whenever it is possible.

#include <cstdint> #include <smmintrin.h> __attribute__((target("sse4.2"))) std::uint32_t crc32c(const char* first, const char* last) { std::uint32_t code = ~0U; for ( ; first < last; /* inline */) { if (reinterpret_cast<std::uintptr_t>(first) % 8 == 0 && first + 8 <= last) { code = _mm_crc32_u64(code, *reinterpret_cast<const std::uint64_t*>(first)); first += 8; } else if (reinterpret_cast<std::uintptr_t>(first) % 4 == 0 && first + 4 <= last) { code = _mm_crc32_u32(code, *reinterpret_cast<const std::uint32_t*>(first)); first += 4; } else if (reinterpret_cast<std::uintptr_t>(first) % 2 == 0 && first + 2 <= last) { code = _mm_crc32_u16(code, *reinterpret_cast<const std::uint16_t*>(first)); first += 2; } else // if (reinterpret_cast<std::uintptr_t>(first) % 1 == 0 && first + 1 <= last) { code = _mm_crc32_u8(code, *reinterpret_cast<const std::uint8_t*>(first)); first += 1; } } return ~code; }

The algorithm is simple: Opportunistically stride by the largest legal increment or fall through to a smaller increment. The if after the final else is commented out because it will always be true, as \(x \in \mathbb{N} \rightarrow x \equiv 0 \mod 1\) and \(n + 1 \leq m \rightarrow n < m\).

This is around 6 times faster than the byte-by-byte implementation. The 32 KiB spike is imperceptible at this point. The speedup is in the realm of reasonable expectations, as we are executing approximately \(\frac{1}{8}\) the instructions. Unfortunately, it is not 8 times faster...why?

D: Explicit Strides

It seems like algorithm C is wasting a lot of cycles reprocessing the same information. Adding 8 to a first which matched reinterpret_cast<std::uintptr_t>(first) % 8 == 0 will continue to match the predicate. Things get worse if we are not 8 byte aligned -- if the input is aligned to byte 1, we have to perform the same checks a number of times. A Sufficiently Smart Compiler would figure this out and use that information for optimization purposes. Is GCC sufficiently smart? Let's assume GCC isn't and be explicit about the logic that we as Sufficiently Smart Computer Programmers know.

#include <cstdint> #include <smmintrin.h> __attribute__((target("sse4.2"))) std::uint32_t crc32d(const char* first, const char* last) { std::uint32_t code = ~0U; while (first + 1 <= last && (reinterpret_cast<std::uintptr_t>(first) % 2 != 0 || first + 2 > last) ) { code = _mm_crc32_u8(code, *reinterpret_cast<const std::uint8_t*>(first)); first += 1; } while (first + 2 <= last && (reinterpret_cast<std::uintptr_t>(first) % 4 != 0 || first + 4 > last) ) { code = _mm_crc32_u16(code, *reinterpret_cast<const std::uint16_t*>(first)); first += 2; } while (first + 4 <= last && (reinterpret_cast<std::uintptr_t>(first) % 8 != 0 || first + 8 > last) ) { code = _mm_crc32_u32(code, *reinterpret_cast<const std::uint32_t*>(first)); first += 4; } while (reinterpret_cast<std::uintptr_t>(first) % 8 == 0 && first + 8 <= last ) { code = _mm_crc32_u64(code, *reinterpret_cast<const std::uint64_t*>(first)); first += 8; } while (first + 4 <= last) { code = _mm_crc32_u32(code, *reinterpret_cast<const std::uint32_t*>(first)); first += 4; } while (first + 2 <= last) { code = _mm_crc32_u16(code, *reinterpret_cast<const std::uint16_t*>(first)); first += 2; } while (first + 1 <= last) { code = _mm_crc32_u8(code, *reinterpret_cast<const std::uint8_t*>(first)); first += 1; } return ~code; }

Why all the while s? Clearly, entering a block under the condition first + 1 <= last && (reinterpret_cast<std::uintptr_t>(first) % 2 != 0 || first + 2 > last) , then unconditionally performing first += 1 will always make it fail the condition check immediately afterward. Okay, maybe that's only clear to the compiler. It's definitely true, though...so why not just say if ? The answer will come when we genericify this function (saves for part 2). While CRC offers instructions for every power of 2 along the way, there are a number of SSE instructions which only offer 8 byte varieties. In these cases, we would actually need a while loop. Don't worry -- the compiler is smart enough to figure this out (if you're curious or suspicious, you can replace the non- _mm_crc32_u64 while s with if s and benchmark the result).

It looks like they are about the same speed. If anything, algorithm C is faster, although only by an infinitesimal amount and only for very small buffers. To figure out why, lets look at the disassembly of crc32c .

_Z6crc32cPKcS0_: cmp %rsi,%rdi mov $0xffffffff,%eax jne 401799 <_Z6crc32cPKcS0_+0x29> jmp 4017eb <_Z6crc32cPKcS0_+0x7b> nopl 0x0(%rax) lea 0x8(%rdi),%rdx cmp %rdx,%rsi jb 40179f <_Z6crc32cPKcS0_+0x2f> mov %eax,%eax crc32q (%rdi),%rax mov %rdx,%rdi cmp %rdi,%rsi je 4017bb <_Z6crc32cPKcS0_+0x4b> test $0x7,%dil je 401780 <_Z6crc32cPKcS0_+0x10> test $0x3,%dil jne 4017c0 <_Z6crc32cPKcS0_+0x50> lea 0x4(%rdi),%rdx cmp %rdx,%rsi jb 4017c0 <_Z6crc32cPKcS0_+0x50> crc32l (%rdi),%eax mov %rdx,%rdi cmp %rdi,%rsi jne 401799 <_Z6crc32cPKcS0_+0x29> not %eax retq xchg %ax,%ax test $0x1,%dil jne 4017e0 <_Z6crc32cPKcS0_+0x70> lea 0x2(%rdi),%rdx cmp %rdx,%rsi jb 4017e0 <_Z6crc32cPKcS0_+0x70> crc32w (%rdi),%eax mov %rdx,%rdi jmp 401794 <_Z6crc32cPKcS0_+0x24> nopw 0x0(%rax,%rax,1) crc32b (%rdi),%eax add $0x1,%rdi jmp 401794 <_Z6crc32cPKcS0_+0x24> xor %eax,%eax retq xchg %ax,%ax

The key part of the assembly is between lea 0x8(%rdi),%rdx ( %rdx = %rdi + 8 ) and the je after test $0x7,%dil ( %test = 0x7 & %dil aka %dil % 8 ), inside of which you will find the crc32q instruction. This is where the vast majority of processing time is spent -- a 31 byte block of code. The usages of crc32l , crc32w , and crc32b all jump to the same test $0x7,%dil . In other words, the compiler is not smart enough to unroll the one-time uses for us. Or perhaps it is smart enough to do so but is so smart that it knows there would be a performance penalty. In the worst case, these blocks will only be entered 6 times (7 byte offsets at the front and end), so being a little more expensive in these cases doesn't hurt.

On the other hand, algorithm D always checks for alignment of 1, 2, and 4 before getting to the core crc32q loop. This pessimism comes with a price, even for those who provide correctly-aligned ranges; algorithm C does not penalize you unless your ranges are unaligned. The resulting assembly for crc32c is a meager 128 bytes, while crc32d weighs in at a massive 560 bytes. This difference means algorithm C has less of an penalty on your instruction cache in the real world. Furthermore, C is shorter and more readable (to me). Basically, C is better than D in every aspect.