Tuesday, February 8th, 2011 by Nigel Jones

This is the thirteenth in a series of tips on writing efficient C for embedded systems. As the title suggests, if you are interested in writing efficient C, you need to be cautious about using the modulus operator. Why is this? Well a little thought shows that C = A % B is equivalent to C = A – B * (A / B). In other words the modulus operator is functionally equivalent to three operations. As a result it’s hardly surprising that code that uses the modulus operator can take a long time to execute. Now in some cases you absolutely have to use the modulus operator. However in many cases it’s possible to restructure the code such that the modulus operator is not needed. To demonstrate what I mean, some background information is in order as to how this blog posting came about.

Converting seconds to days, hours, minutes and seconds

In Embedded Systems Design there is an increasing need for some form of real time clock. When this is done, the designer typically implements the time as a 32 bit variable containing the number of seconds since a particular date. When this is done, it’s not usually long before one has to convert the ‘time’ into days, hours, minutes and seconds. Well I found myself in just such a situation recently. As a result, I thought a quick internet search was in order to find the ‘best’ way of converting ‘time’ to days, hours, minutes and seconds. The code I found wasn’t great and as usual was highly PC centric. I thus sat down to write my own code.

Attempt #1 – Using the modulus operator

My first attempt used the ‘obvious’ algorithm and employed the modulus operator. The relevant code fragment appears below.

void compute_time(uint32_t time) { uint32_t days, hours, minutes, seconds; seconds = time % 60UL; time /= 60UL; minutes = time % 60UL; time /= 60UL; hours = time % 24UL; time /= 24UL; days = time; }

This approach has a nice looking symmetry to it. However, it contained three divisions and three modulus operations. I thus was rather concerned about its performance and so I measured its speed for three different architectures – AVR (8 bit), MSP430 (16 bit), and ARM Cortex (32 bit). In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and include the test harness overhead:

AVR: 29,825 cycles

MSP430: 27,019 cycles

ARM Cortex: 390 cycles

No that isn’t a misprint. The ARM was nearly two orders of magnitude more cycle efficient than the MSP430 and AVR. Thus my claim that the modulus operator can be very inefficient is true for some architectures – but not all. Thus if you are using the modulus operator on an ARM processor then it’s probably not worth worrying about. However if you are working on smaller processors then clearly something needs to be done – and so I investigated some alternatives.

Attempt #2 – Replace the modulus operator

As mentioned in the introduction, C = A % B is equivalent to C = A – B * (A / B). If we compare this to the code in attempt 1, then it should be apparent that the intermediate value (A/B) computed as part of the modulus operation is in fact needed in the next line of code. Thus this suggests a simple optimization to the algorithm.

void compute_time(uint32_t time) { uint32_t days, hours, minutes, seconds; days = time / (24UL * 3600UL); time -= days * 24UL * 3600UL; /* time now contains the number of seconds in the last day */ hours = time / 3600UL; time -= (hours * 3600UL); /* time now contains the number of seconds in the last hour */ minutes = time / 60U; seconds = time - minutes * 60U; }

In this case I have replaced three mods with three subtractions and three multiplications. Thus although I have replaced a single operator (%) with two operations (- *) I still expect an increase in speed because the modulus operator is actually three operators in one (- * /). Thus effectively I have eliminated three divisions and so I expected a significant improvement in speed. The results however were a little surprising:

AVR: 18,720 cycles

MSP430: 14,805 cycles

ARM Cortex: 384 cycles

Thus while this technique yielded a roughly order of two improvements for the AVR and MSP430 processors, it had essentially no impact on the ARM code. Presumably this is because the ARM has native support for the modulus operation. Notwithstanding the ARM results, it’s clear that at least in this example, it’s possible to significantly speed up an algorithm by eliminating the modulus operator.

I could of course just stop at this point. However examination of attempt 2 shows that further optimizations are possible by observing that if seconds is a 32 bit variable, then days can be at most a 16 bit variable. Furthermore, hours, minutes and seconds are inherently limited to an 8 bit range. I thus recoded attempt 2 to use smaller data types.

Attempt #3 – Data type size reduction

My naive implementation of the code looked like this:

void compute_time(uint32_t time) { uint16_t days; uint8_t hours, minutes, seconds; uint16_t stime; days = (uint16_t)(time / (24UL * 3600UL)); time -= (uint32_t)days * 24UL * 3600UL; /* time now contains the number of seconds in the last day */ hours = (uint8_t)(time / 3600UL); stime = time - ((uint32_t)hours * 3600UL); /*stime now contains the number of seconds in the last hour */ minutes = stime / 60U; seconds = stime - minutes * 60U; }

All I have done is change the data types and to add casts where appropriate. The results were interesting:

AVR: 14,400 cycles

MSP430: 11,457 cycles

ARM Cortex: 434 cycles

Thus while this resulted in a significant improvement for the AVR & MSP430, it resulted in a significant worsening for the ARM. Clearly the ARM doesn’t like working with non 32 bit variables. Thus this suggested an improvement that would make the code a lot more portable – and that is to use the C99 fast types. Doing this gives the following code:

Attempt #4 – Using the C99 fast data types

void display_time(uint32_t time) { uint_fast16_t days; uint_fast8_t hours, minutes, seconds; uint_fast16_t stime; days = (uint_fast16_t)(time / (24UL * 3600UL)); time -= (uint32_t)days * 24UL * 3600UL; /* time now contains the number of seconds in the last day */ hours = (uint_fast8_t)(time / 3600UL); stime = time - ((uint32_t)hours * 3600UL); /*stime now contains the number of seconds in the last hour */ minutes = stime / 60U; seconds = stime - minutes * 60U; }

All I have done is change the data types to the C99 fast types. The results were encouraging:

AVR: 14,400 cycles

MSP430: 11,595 cycles

ARM Cortex: 384 cycles

Although the MSP430 time increased very slightly, the AVR and ARM stayed at their fastest speeds. Thus attempt #4 is both fast and portable.

Conclusion

Not only did replacing the modulus operator with alternative operations result in faster code, it also opened up the possibility for further optimizations. As a result with the AVR & MSP430 I was able to more than halve the execution time.

Converting Integers for Display

A similar problem (with a similar solution) occurs when one wants to display integers on a display. For example if you are using a custom LCD panel with say a 3 digit numeric field, then the problem arises as to how to determine the value of each digit. The obvious way, using the modulus operator is as follows:

void display_value(uint16_t value) { uint8_t msd, nsd, lsd; if (value > 999) { value = 999; } lsd = value % 10; value /= 10; nsd = value % 10; value /= 10; msd = value; /* Now display the digits */ }

However, using the technique espoused above, we can rewrite this much more efficiently as:

void display_value(uint16_t value) { uint8_t msd, nsd, lsd; if (value > 999U) { value = 999U; } msd = value / 100U; value -= msd * 100U; nsd = value / 10U; value -= nsd * 10U; lsd = value; /* Now display the digits */ }

If you benchmark this you should find it considerably faster than the modulus based approach.

Previous Tip