December 09, 2019

nullprogram.com/blog/2019/12/09/

There’s an old saying, let the knife do the work. Whether preparing food in the kitchen or whittling a piece of wood, don’t push your weight into the knife. Not only is it tiring, you’re much more likely to hurt yourself. Use the tool properly and little force will be required.

The same advice also often applies to compilers.

Suppose you need to XOR two, non-overlapping 64-byte (512-bit) blocks of data. The simplest approach would be to do it a byte at a time:

/* XOR src into dst */ void xor512a ( void * dst , void * src ) { unsigned char * pd = dst ; unsigned char * ps = src ; for ( int i = 0 ; i < 64 ; i ++ ) { pd [ i ] ^= ps [ i ]; } }

Maybe you benchmark it or you look at the assembly output, and the results are disappointing. Your compiler did exactly what you asked of it and produced code that performs 64 single-byte XOR operations (GCC 9.2.0, x86-64, -Os ):

xor512a: xor eax , eax .L0: mov cl , [ rsi + rax ] xor [ rdi + rax ], cl inc rax cmp rax , 64 jne .L0 ret

The target architecture has wide registers so it could be doing at least 8 bytes at a time. Since your compiler isn’t doing it, you decide to chunk the work into 8 byte blocks yourself in an attempt to manually implement a chunking operation. Here’s some real world code that does so:

/* WARNING: Broken, do not use! */ void xor512b ( void * dst , void * src ) { uint64_t * pd = dst ; uint64_t * ps = src ; for ( int i = 0 ; i < 8 ; i ++ ) { pd [ i ] ^= ps [ i ]; } }

You check the assembly output of this function, and it looks much better. It’s now processing 8 bytes at a time, so it should be about 8 times faster than before.

xor512b: xor eax , eax .L0: mov rcx , [ rsi + rax * 8 ] xor [ rdi + rax * 8 ], rcx inc rax cmp rax , 8 jne .L0 ret

Still, this machine has 16-byte wide registers (SSE2 xmm ), so there could be another doubling in speed. Oh well, this is good enough, so you plug it into your program. But something strange happens: The output is now wrong!

int main ( void ) { uint32_t dst [ 32 ] = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 }; uint32_t src [ 32 ] = { 1 , 4 , 9 , 16 , 25 , 36 , 49 , 64 , 81 , 100 , 121 , 144 , 169 , 196 , 225 , 256 , }; xor512b ( dst , src ); for ( int i = 0 ; i < 16 ; i ++ ) { printf ( "%d

" , ( int ) dst [ i ]); } }

Your program prints 1..16 as if xor512b() was never called. You check over everything a dozen times, and you can’t find anything wrong. Even crazier, if you disable optimizations then the bug goes away. It must be some kind of compiler bug!

Investigating a bit more, you learn that the -fno-strict-aliasing option also fixes the bug. That’s because this program violates C strict aliasing rules. An array of uint32_t was accessed as a uint64_t . As an important optimization, compilers are allowed to assume such variables do not alias and generate code accordingly. Otherwise every memory store could potentially modify any variable, which limits the compiler’s ability to produce decent code.

The original version is fine because char * , including both signed and unsigned , has a special exemption and may alias with anything. For the same reason, using char * unnecessarily can also make your programs slower.

What could you do to keep the chunking operation while not running afoul of strict aliasing? Counter-intuitively, you could use memcpy() . Copy the chunks into legitimate, local uint64_t variables, do the work, and copy the result back out.

void xor512c ( void * dst , void * src ) { for ( int i = 0 ; i < 8 ; i ++ ) { uint64_t buf [ 2 ]; memcpy ( buf + 0 , ( char * ) dst + i * 8 , 8 ); memcpy ( buf + 1 , ( char * ) src + i * 8 , 8 ); buf [ 0 ] ^= buf [ 1 ]; memcpy (( char * ) dst + i * 8 , buf , 8 ); } }

Since memcopy() is a built-in function, your compiler knows its semantics and can ultimately elide all that copying. The assembly listing for xor512c is identical to xor512b , but it won’t go haywire when integrated into a real program.

It works and it’s correct, but you can still do much better than this!

Letting your compiler do the work

The problem is you’re forcing the knife and not letting it do the work. There’s a constraint on your compiler that hasn’t been considered: It must work correctly for overlapping inputs.

char buf [ 74 ] = {...}; xor512a ( buf , buf + 10 );

In this situation, the byte-by-byte and chunked versions of the function will have different results. That’s exactly why your compiler can’t do the chunking operation itself. However, you don’t care about this situation because the inputs never overlap.

Let’s revisit the first, simple implementation, but this time being smarter about it. The restrict keyword indicates that the inputs will not overlap, freeing your compiler of this unwanted concern.

void xor512d ( void * restrict dst , void * restrict src ) { unsigned char * pd = dst ; unsigned char * ps = src ; for ( int i = 0 ; i < 64 ; i ++ ) { pd [ i ] ^= ps [ i ]; } }

(Side note: Adding restrict to the manually chunked function, xor512b() , will not fix it. Using restrict can never make an incorrect program correct.)

Compiled with GCC 9.2.0 and -O3 , the resulting unrolled code processes 16-byte chunks at a time ( pxor ):

xor512d: movdqu xmm0 , [ rdi + 0x00 ] movdqu xmm1 , [ rsi + 0x00 ] movdqu xmm2 , [ rsi + 0x10 ] movdqu xmm3 , [ rsi + 0x20 ] pxor xmm0 , xmm1 movdqu xmm4 , [ rdi + 0x30 ] movups [ rdi + 0x00 ], xmm0 movdqu xmm0 , [ rdi + 0x10 ] pxor xmm0 , xmm2 movups [ rdi + 0x10 ], xmm0 movdqu xmm0 , [ rdi + 0x20 ] pxor xmm0 , xmm3 movups [ rdi + 0x20 ], xmm0 movdqu xmm0 , [ rsi + 0x30 ] pxor xmm0 , xmm4 movups [ rdi + 0x30 ], xmm0 ret

Compiled with Clang 9.0.0 with AVX-512 enabled in the target ( -mavx512bw ), it does the entire operation in a single, big chunk!

xor512d: vmovdqu64 zmm0 , [ rdi ] vpxorq zmm0 , zmm0 , [ rsi ] vmovdqu64 [ rdi ], zmm0 vzeroupper ret

“Letting the knife do the work” means writing a correct program and lifting unnecessary constraints so that the compiler can use whatever chunk size is appropriate for the target.