In almost all assembly books you’ll find some nice tricks to do fast multiplications. E.g. instead of “imul eax, ebx, 3” you can do “lea eax, [ebx+ebx*2]”Â (ignoring flag effects). It’s pretty clear how this works. But how can we speed up, say, a division by 3? This is quite important since division is still a really slow operation. If you never thought or heart about this problem before, get pen and paper and try a little bit. It’s an interesting problem.

The non-obvious trick here is: We can multiply by the inverse (i.e. the reciprocal). Maybe this sounds a little bit strange, since the inverse of 3 is what, 1/3? Yes, it is, over the rational numbers. But here we calculate over the finite ring Z/(232)Z (for the non-mathematical readers: We wrap around after 232). And in Z/(232)Z all odd elements have a multiplicative inverse. We can find the inverse with the extended euclidean algorithm (gcdex):

For numbers a, b gcdex(a, b) returns [g, x, y] where

g is the greatest common divisor

x, y are some numbers,

such that x*a + y*b = g

E.g. gcdex(3, 232) = [1, -1431655765, 1]

That means:

-1431655765 * 3 + 1*232 = 1

Since we are using the modulo 232 math (we have 32 bit registers), this is exactly what we need:

-1431655765*3 + 232 = 0xaaaa_aaab * 3 = 1 (mod 232)

So 1/3 = 0xaaaa_aaab (mod 232)

Now, let’s look what we have so far:

For numbers divisible by 3 this is already our solution. We just have to multiply them by 0xaaaa_aaab, since this is the inverse of 3. But — in the general case — we want this to work with all values (the remainder should be ignored, but with our solution it destroys the result).

We need the another trick: We look at the 64 bit result and count the “overflows”. This is somehow similar to fixed point arithmetic. Look at this:

3 * 0xaaaa_aaab = 0x2_0000_0001, and that means

(n*3) * 0xaaaa_aaab = n * 233 + n

This is cool, because

(n*3 + c) * 0xaaaa_aaab = (n*233 + n) + c*0xaaaa_aaab

So, with c in {0, 1, 2} we have the nice identity:

(n * 0xaaaa_aaab) >> 33 = n div 3, for all 0 <= n < 232

Last but not least, our final solution for dividing by 3 is:

[IN: eax dividend]

mov ecx, 0xaaaa_aaab // 1/3

mul ecx // multiply with inverse

shr edx, 1 // shift by 33

[OUT: edx = eax div 3]

Of course, this is only half of the story; this is not the general solution for all divisors. And it doesn’t handle signed division. For futher reading see this paper or this gcc bug report. It also has its chapter in the AMD optimization manual.

BTW: This is part of reason why you have to learn all that math stuff in computer science. It’s useful! :-)