Adventures in Engineering

The wanderings of a modern ronin.



/* Copyright (c) 2012, Ben Cantrick All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ // // intmod(y, x): return y % x, or -1 on error // #include <stdint.h> // Requires C99. int32_t intmod(uint32_t y, uint32_t x) { if(y < x) { return -1; // Dumbass caller } if(y == x) { return 0; // Smartass caller } if(x == 2) { return (y & (uint32_t)0x1); // Easy case } uint32_t guess, temp; guess = x; while(guess < y) { guess = guess << 1; // guess = guess * 2; } if(guess == y) { return 0; } // Lucky! // guess must be > y guess = guess >> 1; // newguess = guess / 2 temp = guess >> 1; // temp = newguess / 2 while(temp > x) { if(guess + temp < y) { guess = guess + temp; } temp = temp >> 1; // temp = temp / 2 } return (y - guess); }



A fully working sample program can be downloaded at



I created this algorithm as a result of reading



How does the algorithm work? Well, the mathematical definition of modulo is:



y % x = y - (x * int(y/x))



The trick this algorithm employs is to do a kind of weirdo binary search to find (x * int(y/x)) , without ever directly computing int(y/x) , or using division.



The first loop keeps multiplying x by 2 until it's larger than y . We call this guess . Once guess becomes larger than y , we stop and divide guess by 2, essentially backing up one step. At this point we know that (x * int(y/x)) is between guess and guess * 2 .



To actually compute (x * int(y/x)) , the second loop goes backward and keeps adding smaller power-of-two multiples of x (held in temp ) to guess . But it checks each time to make sure that (guess + (x * 2^n)) < y , and does not do the addition if this would be the case. It keeps going until n = 1, at which point we have added all appropriate (2^n * x) terms to guess. The resulting number in guess is exactly (x * int(y/x)) . And now it's a simple matter to get the modulus by doing (y - guess) . Quickly Ends Dat.



All aforementioned divisions and multiplications are accomplished with bit shifts, which are fast operations implemented by hardware on every CPU I've ever seen. (Unlike integer div or modulo.) They also have the nice property that most CPUs can bit shift through a CPU flag or register, allowing the use of multiple-CPU-word size integers without a large performance penalty.



The code does have some notable limitations. First of all, it expects unsigned ints only. Mathematically it is allowable to take the modulo of negative numbers, but this code won't do that. Since the function's arguments are defined as unsigned, your compiler should throw a warning or error if you attempt to pass in negative numbers. I'm sure the function could be extended to handle negative numbers, but I don't think I've done a modulo on a negative number in 20 years of writing code, so I'm not going to bother. Secondly, the function doesn't work with floats, doubles or any other exponent/mantissa numeric representation, because bit-shifting those doesn't necessarily result in a multiply or divide by 2.





Performance analysis: The first while loop will run at most log2(y/3) times (when x=3), and the same is true of the second loop. 2 * log2(y/3) translates to O(log2(y)). The average case will be much faster than this worst case. Analysis with gprof and gcov on the sample program shows 51 lines of code executed to compute 31952 % 99. I don't have valgrind installed so I can't give better numbers. If someone wants to do a real "number of instructions executed" analysis, send me a copy so I can put it up here.



Is it possible to do this faster? I strongly suspect so. As the Wikipedia link given below notes, it is possible to reduce (y % x) to (y & (x-1)) when x is a power of 2. Seems to me that trick should be exploitable to bring the time to compute modulo down even farther. But in any event, the above unsophisticated and unoptimized algorithm should still be a very large improvement over modulo implemented via a software divide routine... as some people's compilers are probably doing now.





See also:



http://en.wikipedia.org/wiki/Modulo_operation#Performance_issues



http://graphics.stanford.edu/~seander/bithacks.html#ModulusDivisionEasy



http://stackoverflow.com/questions/2566010/fastest-way-to-calculate-a-128-bit-integer-modulo-a-64-bit-integer A fully working sample program can be downloaded at http://www.gully.org/~mackys/lj/integermodulo.c.txt I created this algorithm as a result of reading a post on Reddit that showed some shockingly poor performance for intrinsic software "%" operator in some C compilers for microcontrollers which don't have hardware divide.How does the algorithm work? Well, the mathematical definition of modulo is:The trick this algorithm employs is to do a kind of weirdo binary search to find, without ever directly computing, or using division. The first loop keeps multiplyingby 2 until it's larger than. We call this. Oncebecomes larger than, we stop and divideby 2, essentially backing up one step. At this point we know thatis betweenandTo actually compute, the second loop goes backward and keeps adding smaller power-of-two multiples of(held in) to. But it checks each time to make sure that, and does not do the addition if this would be the case. It keeps going until n = 1, at which point we have added all appropriateterms to guess. The resulting number inis exactly. And now it's a simple matter to get the modulus by doinguicklyndsat.All aforementioned divisions and multiplications are accomplished with bit shifts, which are fast operations implemented by hardware on every CPU I've ever seen. (Unlike integer div or modulo.) They also have the nice property that most CPUs can bit shift through a CPU flag or register, allowing the use of multiple-CPU-word size integers without a large performance penalty.The code does have some notable limitations. First of all, it expects unsigned ints only. Mathematically it is allowable to take the modulo of negative numbers, but this code won't do that. Since the function's arguments are defined as unsigned, your compiler should throw a warning or error if you attempt to pass in negative numbers. I'm sure the function could be extended to handle negative numbers, but I don't think I've done a modulo on a negative number in 20 years of writing code, so I'm not going to bother. Secondly, the function doesn't work with floats, doubles or any other exponent/mantissa numeric representation, because bit-shifting those doesn't necessarily result in a multiply or divide by 2.Performance analysis: The first while loop will run at most log2(y/3) times (when x=3), and the same is true of the second loop. 2 * log2(y/3) translates to O(log2(y)). The average case will be much faster than this worst case. Analysis withandon the sample program shows 51 lines of code executed to compute 31952 % 99. I don't have valgrind installed so I can't give better numbers. If someone wants to do a real "number of instructions executed" analysis, send me a copy so I can put it up here. Is it possible to do this faster? I strongly suspect so. As the Wikipedia link given below notes, it is possible to reduce (y % x) to (y & (x-1)) when x is a power of 2. Seems to me that trick should be exploitable to bring the time to compute modulo down even farther. But in any event, the above unsophisticated and unoptimized algorithm should still be a very large improvement over modulo implemented via a software divide routine... as some people's compilers are probably doing now.See also: Post A Comment | 3 Comments | Share | Flag | Link









One may justifiably ask, "Why do we need a fast modulo on a microcontroller?"



One reason is shown in the "shockingly poor performance" link given above. If you have a real-time clock and need to frequently display the current time, you may have to perform a lot of % 60 and % 12 on your raw timeval from your RTC.



Another reason is if you're trying to do a lot of hashing, since that generally requires a modulo. Not sure why you'd be doing a bunch of hashing in an embedded system. But if you do need to, and it needs to be reasonably performant, then... Reply | Thread | Link





If I recall correctly, MIPS has a modulo instruction. Is that not a common thing for a CPU to have? Reply | Thread | Link





Depends on the CPU. Very common for 8-bit microcontrollers to not have modulo or hardware divide. Most don't have hardware multiply either. Reply | Parent | Thread | Link





browse May 2015