Hacks with integer arithmetic are always fun, especially when they’re not too hard to understand (because some are really strange and make you wonder what the author was thinking when he wrote that). One such simple hack is to replace divisions or multiplies by series of shifts and additions.

However, these hacks make a lot of assumptions that aren’t necessarily verified on the target platform. The assumptions are that complex instructions such as mul and div are very slow compared to adds, right shifts or left shifts, that execution time of shifts only depends very loosely on the number of bit shifted, and that the compiler is not smart enough to generate the appropriate code for you.

Of course, these assumptions may or mayn’t be verified on your target platform and it is important to assess the goodness of any optimization with a stopwatch in hand—anyway, more on this later. Let us come back to our shifts.

Let’s say we want to multiply 16 bits integers by . This isn’t a nice fraction we can replace by a simple shift right away; it’s conspicuously not a simple power of 2. But it’s a series of sums of power of two. Indeed, if we look at it differently,

To each 1 in the binary expansion of corresponds a shifted value of :

Which suggests the direct (but naïve) approximation (x>>2)+(x>>5)+... . That actually works, but may not be very precise because bits that should contribute to the sum get shifted out. C shift operators will just truncate the fractional bits—it’s an integer operation, after all—so you loose bits that should contribute to the sum. The first impression is that it’s just a rounding error (meaning, the last bit or so) but it may yield a large error.

The second thing you may try is to scale x before shifting. One way of doing a ‘free’ scaling is to use a trick similar to what I used for an alternative line algorithm. That is, we’ll use the language to have a shift free scaling:

typedef union { // endian-dependant! struct { uint16_t lo, hi; }; uint32_t s; } fixed_t;

Using this structure, setting lo to 0 and hi to

x promotes x to a 32 bits version shifted by 16 bits. The

whole method could look like:

inline int shifts_more(uint16_t x) { typedef union { // endian-dependant! struct { uint16_t lo, hi; }; uint32_t s; } fixed_t; fixed_t scaled={0,x}; // C99 extensions? fixed_t sum={0}; sum.s+=scaled.s >> 2; sum.s+=scaled.s >> 5; sum.s+=scaled.s >> 8; sum.s+=scaled.s >> 11; sum.s+=scaled.s >> 14; // the next one, >> 17, gives zero! return sum.hi; }

Using this version, all bits contribute to the results and the (relative) error is reduced to its minimum. We could have used an actual shift rather than an union :

inline int shifts_less(uint16_t x) { const int scale=4; uint32_t sx = (x << scale); uint32_t s=0; s+=sx >>= 2; s+=sx >>= 3; s+=sx >>= 3; s+=sx >>= 3; s+=sx >>= 3; return (s>>scale); }

Where you can control time versus precision by changing the value of scale . Or you can look at it the other way ’round:

inline int shifts_left(uint16_t x) { int tx=x; // promote to 32 bits int s=tx; // 14 s+=(tx <<=3); // 11 s+=(tx <<=3); // 8 s+=(tx <<=3); // 5 s+=(tx <<=3); // 2 return (s >> 14); }

Or you can use an integer multiply:

inline int shifts_mult(uint16_t x) { // 01 0010 0100 1001 = 0x1249 return (int(x) * 0x1249) >> 14; }

So which one’s better? Precision-wise, they’re all the same, except for the naïve version that has a higher error and for shift_less that depends on a scaling constant. On all 16 bits integers, the average error (difference between the float version and the integer version) is given by:

Method Average error Naïve 2.07144 others 0.214264

So the other methods have basically 10 times smaller errors. Which is good. What about speed? Well that’s where the catch is. Let us compare, again, g++ against icc, the two compilers I use. On a T9600 Core2 CPU (the one from by Macbook Pro), the times (in micro-seconds by 65536 calls):

method icc gcc Naïve 192 179 More 229 192 Less 225 204 Left 189 142 Mult 166 165 Float 236 236

Let’s see what we have here. G++ seems to be a bit more efficient than icc on these series of tests. The Float method is merely using a float to perform the multiply and truncating to int, and is included as a fair comparison, the control. The Mult method is also the same using both compilers, unsurprisingly, they generate exactly the same code. G++ has the clear advantage for the Left method. Why? What happened?

Examining the code, we make a stunning discovery. Icc generates the following:

<_Z11shifts_leftt>: 0f b7 c7 movzx eax,di 8d 14 c5 00 00 00 00 lea edx,[rax*8] 03 d0 add edx,eax 89 c6 mov esi,eax c1 e6 06 shl esi,0x6 03 f2 add esi,edx 89 c1 mov ecx,eax c1 e1 09 shl ecx,0x9 c1 e0 0c shl eax,0xc 03 f1 add esi,ecx 03 c6 add eax,esi c1 e8 0e shr eax,0xe c3 ret

which is somewhat expected, while we don’t know why it prefers to do long shifts rather than a series of shorter shift lefts. Gcc figured it out a lot better:

<_Z11shifts_leftt>: 0f b7 c7 movzx eax,di 69 c0 49 12 00 00 imul eax,eax,0x1249 c1 f8 0e sar eax,0xe c3 ret

(the thing I can’t explain right now is why the shifts_left version and the shifts_mult version, which compiles to the exact same code, have different timings; and it’s not OS noise because the timings are quite different and the results are repeatable).

Despite GCC generating the same code on my AMD4000 box (yes, I know, I should get a newer computer), the run-time characteristics are quite different. Of course, it is slower, but the float version us much slower, about twice the time of More.

*

* *

So what is the morale of all this? First, don’t trust your compiler. The “optimizing” compiler didn’t optimize all that well, but the “non-optimizing” compiler did. Second, implementations that you think are roughly equivalent may be quite different speed-wise, such as, say, the Less and Left methods; turns out Left is much faster with both compilers. Third, test on the target platform(s), you may be surprised.

Share this: Reddit

Twitter

More

Facebook

Email



Like this: Like Loading... Related