The 8-Byte Two-Step

While studying the illumos queue implementation I came across a line of code that left me quite perplexed.

#define MQ_ALIGNSIZE 8 /* 64-bit alignment */ ... temp = ( mqhp -> mq_maxsz + MQ_ALIGNSIZE - 1 ) & ~ ( MQ_ALIGNSIZE - 1 );

I knew it had something to do with alignment on a 64-bit boundary, but just how the hell was it doing it? Why is it subtracting one from the alignment size? Why is it negating the expression on the right? Why is it adding to the max message size? What the hell is going on?

At first I attempted to derive the pattern in my head but kept losing my place; forced to start over multiple times. After placing my pride on the bookshelf next to me, I wrote some C to help decode the mystery. The program breaks the expression into its three parts, printing the base-2 representation of each step. Here are some test runs with different sizes:

$ ./align_debug 1024 size: 00000000000000000000010000000000 (1024) a: 00000000000000000000010000000111 (1031) b: 11111111111111111111111111111000 (4294967288) temp: 00000000000000000000010000000000 (1024) $ ./align_debug 1023 size: 00000000000000000000001111111111 (1023) a: 00000000000000000000010000000110 (1030) b: 11111111111111111111111111111000 (4294967288) temp: 00000000000000000000010000000000 (1024) $ ./align_debug 9 size: 00000000000000000000000000001001 (9) a: 00000000000000000000000000010000 (16) b: 11111111111111111111111111111000 (4294967288) temp: 00000000000000000000000000010000 (16) $ ./align_debug 11 size: 00000000000000000000000000001011 (11) a: 00000000000000000000000000010010 (18) b: 11111111111111111111111111111000 (4294967288) temp: 00000000000000000000000000010000 (16)

Breaking apart the expression and testing several sizes revealed the pattern to me. Pause for a moment, do you see it?

Don’t worry if you couldn’t spot the pattern, I’m going to explain it right now. I’ve decided to call it the “8-Byte Two-Step” because it does its job in two steps: overflow and truncate.

Overflow - Add 7 to to make sure the size either hits the next alignment boundary or lies past it. Truncate - Truncate the low-order bits. After the first step you know you are at the next boundary or over it so you simply need to shed the three low bits to make sure you are on an 8-byte boundary.

Still don’t follow? That’s okay, everyone learns differently. Perhaps a visual explanation will help.

In figure 1 I reduced the bit space down to a single byte to simplify matters. The header, in small print, marks the bit positions. The first row shows the powers of two. The next two rows represent the two cases we have to deal with: ❶ the size already lies on an 8-byte boundary, or ❷ it doesn’t lie on a boundary. The last two rows show the binary representation of seven as well as its negation.

Using this diagram, there is only one rule to remember: to be an 8-byte aligned size only the green bit positions may be turned on, and any combination of green bits will do. Don’t believe me? Try it for yourself. Take ❶ and randomly fill in a one or zero for each question mark. Add the powers of 2 of and divide by eight, there should be no remainder.

As a dual to this rule, the red bits can never be turned on. Any combination of red bits can never result in a number that is divisible by eight. Funny, the rule and its dual look a lot like the bitwise-negation of seven and seven, respectively. Adding seven to ❷ will push a one into the green bits causing the least significant green zero bit to turn on and any green one bits below it to turn off. Prove this to yourself by writing out the bit patterns for 26 and 33. Basically, adding seven to the size will add an eight to the green bits (unless the size is already on a boundary). Once that is done the final step is to bitwise-and with negated seven to make sure that only the green bits are turned on.

It turns out this pattern is not called “The 8-Byte Two-Step” or “The Green Bit Rule”, but what great names they are. No, as my friend Steve Vinoski pointed out, bit tricks like this have been in use for a long time. In fact, you can find this trick, and many others like it, in the venerable Hacker’s Delight War03.

If you flip to section 3-1 of Hacker’s Delight you’ll see the exact pattern as described above. In fact, as the title indicates, it is more general than alignment on an 8-byte boundary: “Rounding Up/Down to a Multiple of a Known Power of 2”. The pattern works for alignment on any power of 2. This is why the green bit rule works: any power of 2 above the alignment size is a multiple of that size. And adding two or more multiples of a size results in a number which is also a multiple of that size. Math is a wondrous thing.

Interestingly, the illumos code is actually a mix of two different patterns shown in Hacker’s Delight. In the book the author shows a pattern for finding the boundary based on a power of 2 or based on the log 2 of that value. For example, if the boundary was 8 then its log 2 would be 3. The illumos code could have been written more simply:

temp = ( mqhp -> mq_maxsz + MQ_ALIGNSIZE - 1 ) & - MQ_ALIGNSIZE

But more important than how it is written is—why it is written this way at all?

A key performance trait of a CPU is its frequency, the number of cycles in a second. For example, my development machine contains an Intel i7-3970X which runs six cores at 3.50GHz. That’s six cores each ticking three-and-a-half billion times a second. Ideally, chips would complete an instruction each cycle, but not all instructions are created equal. Some instructions cost more than others. The less cycles an algorithm uses, the faster it can run.

This pattern, then, is an efficient way to align a number to a multiple of a power of two. But efficient compared to what? How else might you implement this algorithm? I came up with two solutions.

#include <math.h> #define ALIGNSZ 8 unsigned int align_1 ( unsigned int x ) { return ( x + ALIGNSZ - 1 ) & ~ ( ALIGNSZ - 1 ); } unsigned int align_2 ( unsigned int x ) { unsigned int boundary = ALIGNSZ ; while ( x > boundary ) { boundary += ALIGNSZ ; } return boundary ; } unsigned int align_3 ( unsigned int x ) { return ALIGNSZ * ceil (( double ) x / ALIGNSZ ); }

The align_1 function is the original 8-Byte Two-Step implementation. The align_2 function adds eight until the boundary is greater than or equal to x . The align_3 function uses division to avoid the loop in align_2 but requires ceil and multiplication to put the answer back in terms of number of bytes. One might think the 3rd is the best since it uses only three operations while the 1st uses five, but looks can be deceiving. Remember, C is a high level language. Let’s take a peak at the assembler.

ASM generated with: gcc -std=c99 -m32 -S -masm=intel -Wall -O0 align.c

align_1: push ebp ; 1 mov ebp , esp ; 1 mov eax , DWORD PTR [ ebp + 8 ] ; 2 add eax , 7 ; 1 and eax , - 8 ; 1 pop ebp ; 2 ret ; ? ;; TOTAL LATENCY: 8 + ? align_2: push ebp ; 1 mov ebp , esp ; 1 sub esp , 16 ; 1 mov DWORD PTR [ ebp - 4 ], 8 ; ? jmp .L4 ; 0 .L5: add DWORD PTR [ ebp - 4 ], 8 ; 6 .L4: mov eax , DWORD PTR [ ebp + 8 ] ; 2 cmp eax , DWORD PTR [ ebp - 4 ] ; ? ja .L5 ; ? mov eax , DWORD PTR [ ebp - 4 ] ; 2 leave ; ? ret ; ? ;; TOTAL LATENCY: (size/8)*(13 + 5?) align_3: push ebp ; 1 mov ebp , esp ; 1 sub esp , 40 ; 1 mov eax , DWORD PTR [ ebp + 8 ] ; 2 mov edx , 0 ; ? mov DWORD PTR [ ebp - 24 ], eax ; 3 mov DWORD PTR [ ebp - 20 ], edx ; 3 fild QWORD PTR [ ebp - 24 ] ; 6 fstp QWORD PTR [ ebp - 16 ] ; 5 fld QWORD PTR [ ebp - 16 ] ; 3 fld TBYTE PTR .LC0 ; 4 fdivp st ( 1 ), st ; 10 fstp QWORD PTR [ ebp - 16 ] ; 5 fld QWORD PTR [ ebp - 16 ] ; 3 fstp QWORD PTR [ esp ] ; 5 call ceil ; ? fld TBYTE PTR .LC0 ; 4 fmulp st ( 1 ), st ; 5 fnstcw WORD PTR [ ebp - 26 ] ; 5 movzx eax , WORD PTR [ ebp - 26 ] ; ? mov ah , 12 ; ? mov WORD PTR [ ebp - 28 ], ax ; 3 fldcw WORD PTR [ ebp - 28 ] ; 8 fistp QWORD PTR [ ebp - 24 ] ; 7 fldcw WORD PTR [ ebp - 26 ] ; 8 mov eax , DWORD PTR [ ebp - 24 ] ; 2 mov edx , DWORD PTR [ ebp - 20 ] ; 2 leave ; ? ret ; ? ;; TOTAL LATENCY: 96 + 6?

I’ve curated the assembly, removing irrelevant lines and adding a latency cost for each instruction. I used Agner Fog’s instruction tables to determine the minimum latency of each instruction Fog14. The numbers represent the minimum number of cycles an instruction uses in an ideal execution. A question mark means there was no entry for the instruction. All numbers are for the Intel Sandybridge as that is what my development machine runs.

It turns out, for any reasonably sized x (the size we want to round up to the nearest boundary), the align_1 function performs the best and align_2 performs the worst. In fact, for a size as small as 1024 the align_2 function will have a latency three orders of magnitude greater than align_1 . Using less cycles is good not only because it means less work; but I posit it reduces the chance of being interrupted which can lead to further latency costs like cache misses.

These theoretical numbers are great, but I want some hard numbers. Time for some benchmarks.

I created a separate C file to benchmark each of the three alignment functions: align_1_bench.c, align_2_bench.c, align_3_bench.c, and align_bench.h. There is also a control benchmark, align_control_bench.c, used as a baseline to account for any constant overhead. Each benchmark runs one-million times, passing 1026u as the size to align. Intel specific assembler is used to count the number of cycles for each invocation and print that number to stdout. Finally, the redirected output is turned into a histogram via a mix of sort, uniq, and awk.

Benchmarks Compiled With gcc -std=c99 -m32 -lm -Wall -O0

Control Align 1 Align 2 Align 3 Cycles Count Cycles Count Cycles Count Cycles Count 112 333,677 112 43 912 2 196 389,942 116 624,414 116 816,038 916 122,676 200 609,995 120 41,860 120 183,855 920 212,176 204 2 124 9 124 3 924 13,760 212 2 128 7 128 5 928 75,642 216 1 132 10 132 9 932 205,533 ⋮ ⋮ 136 7 136 13 936 190,635 264 1 140 5 140 13 940 481 268 1 144 3 144 2 944 2 284 20 148 2 148 5 ⋮ ⋮ 288 11 152 1 152 5 1,036 22 304 2 156 1 156 5 1,040 2,295 308 1 164 1 160 1 1,044 111,111 ⋮ ⋮ 192 1 164 1 1,048 65,609 376 1 240 1 192 1 ⋮ ⋮ 408 1 268 1 9,672 1 14,584 1 11,320 1

Interestingly, the results are not only close to my predictions, but they are lower (perhaps due to ILP or some other CPU voodoo). The baseline uses 112–116 cycles per invocation. The 1st method uses 116–120 cycles, 4 more than the baseline. The 2nd method stays within 916–1048 cycles, 900 cycles more than baseline. And finally, the 3rd method takes 196–200 cycles, 84 more than baseline.

The nugget in this data is the fact that align_2 is multimodal (this is why histograms are so important). This lends some credibility to my theory that the more cycles a function uses the more it is susceptible to variance from outside forces. But for me to be scientific about it much more data would have to be gathered. For now, I’m happy to leave it a conjecture.

I know what some of you are thinking: why care about 100 cycles when the CPU is ticking at 3.5 billion a second? You shouldn’t, in most cases. But when writing code that is invoked millions of times a second an order-of-magnitude overhead may become important. As load increases this overhead adds up like compound interest, reducing the total amount of work your machine can service in an acceptable response time. For example, queueing theory tells us that for an M/D/1 model mean response time rises very quickly after 60% utilization Gre13 §2.6.5. The longer your mean response time the faster you become saturated and the longer the requests take; it’s a vicious cycle. That said, this optimization is not needed in the illumos mqueue code. If you are creating queues at ludicrous speed this is not going to be the overhead that bites you first. My guess, this was done more as an idiom of systems programming than as an optimization.