Modulus without Division, a tutorial Part of the Arithmetic Tutorial Collection

by Douglas W. Jones

T HE U NIVERSITY OF I OWA Department of Computer Science Copyright © 2001, Douglas. W. Jones. This work may be transmitted or stored in electronic form on any computer attached to the Internet or World Wide Web so long as this notice is included in the copy. Individuals may make single copies for their own use. All other rights are reserved.

Index

The ususal way to compute a mod m is to take the remainder after integer division. This is straightforward when the operands are within the range of the available divide hardware, but the divide operation is frequently among the very slowest arithmetic operations available, some small microcontrollers have no divide hardware, and it is occasionally necessary to divide very large numbers outside the range that can be done using the available hardware.

When the modulus m is constant, even where there is a hardware divide instruction, it can be faster to take the modulus directly than to use the divide instruction. These tricks become even more valuable on machines without a hardware divide instruction or where the numbers involved are out of range.

There are common tricks to check divisibility that many students learn in elementary school:

To test for divisibility by 10 see if the least significant decimal digit is zero. To test for divisibility by 2 see if the least significant decimal digit is even. To test for divisibility by 5, see if the least significant decimal digit is 0 or 5. To test for divisibility by 3, if the sum of the decimal digits is divisible by 3, the original number is also divisible by 3. To test for divisibility by 9, if the sum of the decimal digits is divisible by 9, the original number is also divisible by 9.

All of the above rules are actually special cases of a single general rule. Given that

a is represented in number base b a mod m = ( (b mod m)(a/b) + (a mod b) ) mod m

In the case of divisibility by 2, 5 and 10 for base 10, the term (b mod m) is zero because 2, 5 and 10 all divide evenly into 10. As a result, the divisibility test simplifies to asking whether (a mod b), that is, the least significant digit of the number, is evenly divisible.

In the case of divisibility by 3 or 9 in base 10, the term (b mod m) is one. As a result, the multiplier for the first term is one. Applying the formula recursively leads to the simple sum of the digits.

Computing modulus for poweres of two is trivial on a binary computer, the term (b mod m) is zero, so we just take the modulus by examining the least significant bits of the binary representation:

a mod 2i = a & (2i–1)

Thus, for a mod 2, we use a & 1, for a mod 4, we use a & 3, and for a mod 8, we use a & 7.

Recall that the & operator means logical and. When applied to integers, this computes each bit of the result as the and of the corresponding bits of the operands. For all nonzero positive integers i, the binary representation of 2i–1 consists of i consecutive one bits, so anding with 2i–1 preserves the least significant i bits of the operand while forcing all more significant bits to zero.

The problem is more interesting when the modulus is not a power of two.

Consider the problem of computing a mod 3 on a binary computer. Note that 4 mod 3 is 1, so:

a mod 3 = ( (a/4) + (a mod 4) ) mod 3

That is, a mod 3 can be computed from the sum of the digits of the number in base 4. Base 4 is convenient because each base 4 digit of the number consists of 2 bits of the binary represenation; thus a mod 4 can computed using a & 3 and a / 4 can be computed using a >> 2.

The number 3 is a Mersenne number, that is, one less than a power of two. The property noted above is true of all Mersenne numbers. Thus, we can compute a mod 7 or a mod 15 on a binary computer using

a mod 7 = ( (a/8) + (a mod 8) ) mod 7

a mod 15 = ( (a/16) + (a mod 16) ) mod 15

Recall that a >> b shifts the binary representation of a left a total of b places. As with logical and, this is a very inexpensive operation on a binary computer, and the effect is the same as dividing a by 2b.

Back to the problem of computing a mod 3: Summing the base-4 digits of a number may produce results significantly larger than 3, but we can deal with this by summing the digits of the sum as many times as required to bring the result down close to 4. Expressing this as an algorithm using C, we get this code:

unsigned int mod3( unsigned int a ) { while (a > 5) { int s = 0; /* accumulator for the sum of the digits */ while (a != 0) { s = s + (a & 3); a = a >> 2; } a = s; } /* note, at this point: a < 6 */ if (a > 2) a = a - 3; return a; }

The terminating condition in the above code represents a small optimization. Instead of repeatedly summing the digits until the total is a single base 4 digit in the range 0 to 3, this code stops as soon as the sum is under 6. This is because the final mod operation is done by comparing with 3 and subtracting if out of range; this compare and subtract operation can deal with any value up to twice the modulus.

On a 3 Ghz dual-core Intel Core i3 machine, this first version averages 54 nanoseconds slower than a subroutine with the minimalist body " return a%3 " — relying on the built-in mod operation. Of course, this code completely ignores the parallelism available on a computer with registers much wider than the 2 bits in a base-4 digit. There is a well known trick for fast implementation of the bitcount operator that we can easily apply to this problem. In the slow version, bitcount is computed by simply summing the base-2 digits of the number. Here is the classic fast version, written for a 32-bit computer:

uint32_t bitcount( uint32_t a ) { a = ((a >> 1) & 0x55555555) + (a & 0x55555555) /* each 2-bit chunk sums 2 bits */ a = ((a >> 2) & 0x33333333) + (a & 0x33333333) /* each 4-bit chunk sums 4 bits */ a = ((a >> 4) & 0x0F0F0F0F) + (a & 0x0F0F0F0F) /* each 8-bit chunk sums 8 bits */ a = ((a >> 8) & 0x00FF00FF) + (a & 0x00FF00FF) /* each 16-bit chunk sums 16 bits */ return ((a >> 16) ) + (a & 0x0000FFFF) }

This code is shockingly un-intutitive but very fast, particularly when inside an inner loop where the constants can be pre-loaded in registers and remain unchanged for the duration. Note, to add to the obscurity, that we can rewrite this code with slightly different constants and it will still remain correct:

uint32_t bitcount( uint32_t a ) { a = ((a >> 1) & 0x55555555) + (a & 0x55555555) a = ((a >> 2) & 0x33333333) + (a & 0x33333333) a = ((a >> 4) & 0x07070707) + (a & 0x07070707) a = ((a >> 8) & 0x000F000F) + (a & 0x000F000F) return ((a >> 16) ) + (a & 0x0000001F) }

The constants used in the above version reflect the maximum values in each field of the result at that stage of the computation. A sum of the one bits in a 4-bit field will be no greater than 4, which can be represented in 3 bits and so is selected by the mask value 7. A sum of the one bits in an 8-bit field will be no greater than 8, which can be represented in 4 bits and so is selected by the mask value F 16 . A sum of the one bits in a 16-bit field will be no greater than 16, which can be represented in 5 bits and so is selected by the mask value 1F 16 .

This trick can be applied directly to the problem of computing the sum of the base 4 digits in a number. In our original algorithm with two nested loops, we can replace the inner loop to arrive at the following much faster code:

uint32_t mod3( uint32_t a ) { while (a > 5) { a = ((a >> 2) & 0x33333333) + (a & 0x33333333); a = ((a >> 4) & 0x07070707) + (a & 0x07070707); a = ((a >> 8) & 0x000F000F) + (a & 0x000F000F); a = ((a >> 16) ) + (a & 0x0000001F); } /* note, at this point: a < 6 */ if (a > 2) a = a - 3; return a; }

The above version uses the optimized constants, although the range of values in each field is now a bit larger: A sum of two base-4 digits will be no greater than 6, which can still be represented in 3 bits and so is selected by the mask value 7. A sum of four base-4 digits will be no greater than 12, which can still be represented in 4 bits and so is selected by the mask value F 16 . A sum of eight base-4 digits will be no greater than 24, which can still be represented in 5 bits and so is selected by the mask value 1F 16 .

This code is far faster than the first algorithm we gave. On a 3 Ghz dual-core Intel Core i3 machine, this improved code averages 21 nanoseconds slower than a subroutine using the built-in mod operation, less than half the overhead of the original code presented above.

The exercise we went through to identify the maximum values in each field is not pointless, but it is hard work. A careful worst-case analysis shows that the outer loop will iterate at most 3 times. The input FFFFFFEE 16 , for example, leads to 3 iterations. Worst-case analysis also reveals the range of values that are input to each step, allowing th subsequent iteration to be simplified. Unrolling the loop completely leads us to the following code:

uint32_t mod3( uint32_t a ) { a = ((a >> 2) & 0x33333333) + (a & 0x33333333); a = ((a >> 4) & 0x07070707) + (a & 0x07070707); a = ((a >> 8) & 0x000F000F) + (a & 0x000F000F); a = ((a >> 16) ) + (a & 0x0000001F); /* note, at this point: a <= 48, 3 base-4 digits */ a = ((a >> 2) & 0x33333333) + (a & 0x33333333); a = ((a >> 4) ) + (a & 0x07070707); /* note, at this point: a <= 8, 2 base-4 digits */ a = ((a >> 2) ) + (a & 0x33333333); /* note, at this point: a <= 4, 2 base-4 digits */ if (a > 2) a = a - 3; return a; }

The payoff for this optimization is significant. On a 3 Ghz dual-core Intel Core i3 machine, this improved code averages 10.4 nanoseconds slower than a subroutine using the built-in mod operation. This is under 1/5 the overhead of our original code and half the overhead of our first attempt at optimizing.

The eccentric constants used in the above code are annoying, consuming registers and taking time to load. Some machines have a truncate instruction that is equivalent to the following C code:

r = r & ((1 << b) - 1);

That is, the truncate instruction preserves the b least significant bits of the register r while setting the more significant bits of r to zero. Recall that the << operator means left-shift, so the expression 1 << b is equivalent to 2b and therefore (1 << b)–1 has just b ones. A good C compiler should use this instruction whenever to implement the and operation whenever the binary representation of the operand is a string of b one bits, such as the constants 3, 7, 15 or 31. Even when such an instruction is not available, a sequence of two shift operations can be used to do the same thing. On a 32-bit machine, for example, the following sequence of two shifts will frequently truncate the value in a register in less time than it takes to first load a long constant.

r = r << (32 - b) r = r >> (32 - b)

To take advantage of the possibility of fast truncation, we must rewrite our code so that the constants used to select partial words have binary representations that are all ones. We can do this using the fact that not only is a mod 3 easy to compute by summing base 4 digits, but also by summing the digits in base 4i for any positive integer i. So, while we can sum the digits in base 22 (that is, 41), we can also sum them in bases 24 (that is, 42), 26 (that is, 43), 28 (that is, 44), and so on. This leads to the following solution:

uint32_t mod3( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits a <= 0x1FFFE */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits a <= 0x2FD */ a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits a <= 0x3C; worst case 0x3B */ a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x1D; worst case 0x1B */ a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x9; worst case 0x7 */ a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x4 */ if (a > 2) a = a - 3; return a; }

Again, working out the worst cases in order to determine whether we were within reach of the final subtract operation was time consuming. The payoff is significant, but we are approaching the limit of what we can accomplish with this kind of trickery. On a 3 Ghz Intel dual-core Intel Core i3 machine, this improved code averages 6.8 nanoseconds slower than a subroutine using the built-in mod operation. This is about 1/9 the overhead of our original code.

It is fair to ask, have we been two conservative? The above code includes three repeats of this statement:

a = (a >> 2) + (a & 0x3);

In Dec. 2014, Tim Buktu [tbuktu@hotmail.com] pointed out that the code given here can be used to carry out SIMD computations quite efficiently. Consider, for example, taking an array of 4 8-bit integers packed into a 32-bit word. The following function will apply the mod3 operator to all 4 integers without ever unpacking them:

uint32_t SIMDmod3( uint32_t a ) { a = ((a >> 4) & 0x0F0F0F0F) + (a & 0x0F0F0F0F); /* sum base 2**4 digits; a <= 0x1E1E1E1E */ a = ((a >> 2) & 0x0F0F0F0F) + (a & 0x03030303); /* sum base 2**2 digits; a <= 0x09090909 */ a = ((a >> 2) & 0x03030303) + (a & 0x03030303); /* sum base 2**2 digits; a <= 0x04040404 */ a = ((a >> 2) & 0x03030303) + (a & 0x03030303); /* sum base 2**2 digits; a <= 0x03030303 */ /* again, we can't use if(a == 3) a = 0 on each component, so we cheat */ a = ((((a + 1) >> 2) & 0x03030303) + a) & 0x03030303; return a; }

In the SIMD context, there is no obvious way to do an if statement to subtract of 3 from every byte greater than 3, so the final step adds one to each byte and masks to select the carry bit. Where adding 1 produced a carry of 1, the byte was 3 and needs to be zeroed. We zero it by adding the carry bit back into each byte and then masking to delete the new carry.

This code (or similar code on larger words) can compete with code that extracts each byte for division by three. The context in which this idea was originally pointed out to me involved use of the streaming SIMD instructions on the Intel x86, but the idea obviously has broader applications.

The solution developed above for a mod 3 is actually one of the most computationally difficult of the Mersenne numbers. Deleting successive lines (with small modifications) produces solutions for successively larger Mersenne numbers:

uint32_t mod15( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */ a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits */ if (a < 15) return a; if (a < (2 * 15)) return a - 15; return a - (2 * 15); } uint32_t mod255( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */ if (a < 255) return a; if (a < (2 * 255)) return a - 255; return a - (2 * 255); } uint32_t mod65535( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ if (a < 65535) return a; if (a < (2 * 65535)) return a - 65535; return a - (2 * 65535); }

The code for computing a mod 65535 above is slightly faster, on average than the hardware mod operator on a 3.06 Ghz Intel Core i3 processor.

Computing a mod 5 on a binary computer is harder. The smallest binary radix larger than 5 is 8, so we could base a solution on the following:

a mod 5 = ( (8 mod 5)(a/8) + (a mod 8) ) mod 5 = ( 3(a/8) + (a mod 8) ) mod 5

Reducing this to C code, we get the following:

unsigned int mod5( unsigned int a ) { while (a > 9) { int s = 0; /* accumulator for the sum of the digits */ while (a != 0) { s = s + (a & 7); a = (a >> 3) * 3; } a = s; } /* note, at this point: a < 10 */ if (a > 4) a = a - 5; return a; }

On a 3 Ghz Intel dual-core Intel Core i3 machine, this code takes 115 nanoseconds more than a subroutine with the minimalist body " return a%5 " — relying on the built-in mod operation. This is worse than half the speed of our first iterative code for mod 3. The explanation for this poor performance lies in the multiply by 3 inside the inner loop. This not only takes a small amount of time, but it forces significantly more iterations in the loop: Where the mod-3 code divided by 4 with each iteration, this code divides by 8/3.

The repeated multiplications also eliminate the possibility of tricks for summing the digits in parallel, but there is a way around this! While 5 does not divide evenly into 7, it does divide evenly into one less than the next power of two, 15. This lets us solve the problem as follows:

a mod 5 = (a mod 15) mod 5

We can trivially turn the code for a mod 15 given above into code to compute a mod 5 by adding just a little logic at the end:

uint32_t mod5( uint32_t a ) { a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */ a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */ a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits */ if (a > 14) a = a - 15; /* now, we have mod 15 */ if (a > 4) a = a - 5; if (a > 4) a = a - 5; return a; }

The precise details of the end matter slightly. Depending on how the processor is pipelined and how the compiler is optimized, it may be faster to use one of the following alternative endings:

if (a > 9) a = a - 10; else if (a > 4) a = a - 5; return a;

or

if (a > 9) return a - 10; if (a > 4) return a - 4; return a;

Trial and error is valuable here, but the difference between these code fragments is small.

There are many paths to computing a mod 6. We can, of course, take a brute-force approach, but as was the case with a mod 5, the results are not very promising. We could base this base solution on the following:

a mod 6 = ( (8 mod 6)(a/8) + (a mod 8) ) mod 6 = ( 2(a/8) + (a mod 8) ) mod 6

Before reducing this to C code, note that 2(a/8) is not the same as a/4 because we are using integer division. On a binary computer, this can be correctly expressed in two ways, either (a >> 3)<< 1 or (a >> 2)& –2. Recall that, in 2's complement binary, –2 is ...111110, so anding with this constant forces the least significant bit of the result to zero while preserving the remaining bits.

unsigned int mod6( unsigned int a ) { while (a > 11) { int s = 0; /* accumulator for the sum of the digits */ while (a != 0) { s = s + (a & 7); a = (a >> 2) & -2; } a = s; } /* note, at this point: a < 12 */ if (a > 5) a = a - 6; return a; }

==== WORK IN PROGRESS ====

A completely different approach to computing a mod 6 involves noting that 6 can be factored as 2 × 3, so we can compute a mod 6 by first computing a mod 2 and a mod 3. To see how this is done, let us work through a few examples:

a 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 a mod 6 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 a mod 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 a mod 2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

There is a pattern here that repeats every 6 integers; a mod 6 is a simple function of a mod 2 and a mod 3. This function can be represented by the following table:

a mod 6 a mod 3 0 1 2 a mod 2 0 0 4 2 1 3 1 5

==== WORK IN PROGRESS ====

Mersenne numbers are numbers of the form 2i–1. We have already worked two Mersenne numbers, a mod 3 and a mod 15 (in the context of a mod 5). Mersenne numbers are sufficiently important that it is worth working another example. Here, we will work on a mod 7, going directly to the fastest solution. In the case of a mod 3 and a mod 15, the shift counts were all multiples of two, whereas a mod 7 will force us to shift counts that are multiples of 3:

uint32_t mod7( uint32_t a ) { a = (a >> 24) + (a & 0xFFFFFF); /* sum base 2**24 digits a <= 0x10001FE worst case 0xFFFFFE*/ a = (a >> 12) + (a & 0xFFF); /* sum base 2**12 digits a <= 0x1FFD */ a = (a >> 6) + (a & 0x3F); /* sum base 2**6 digits a <= 0xBC; worst case ? */ a = (a >> 3) + (a & 0x7); /* sum base 2**2 digits a <= 0x1B; worst case ? */ a = (a >> 2) + (a & 0x7); /* sum base 2**2 digits a <= 0x6; worst case ? */ if (a > 5) a = a - 6; return a; }

==== WORK IN PROGRESS ====