Introduction

This document has evolved over time and contains a number of the best ways to hand-optimise your C-code. Compilers are good, but they can't do everything, and here I hope to help you squeeze the best performance out of your code. This is not intended for beginners, but for more experienced programmers.



Disclaimer

Depending on your particular hardware and compiler, some of these techniques may actually slow down your code. Do some timings with and without them, as modern compilers may well be able to do things better at a low level. Improving the overall algorithm used will often produce better results than localised code tweaking. This document was originally written as a set of personal notes for myself - do not consider it to be an authoritative paper on the subject of optimisation. I may have made mistakes! If you have anything to add to this, or just have some constructive criticism (flames ignored), please contact me at the address below.

Coding for speed

No error-checking is shown here as this article is only concerned with the fundamentals. By the time the application gets down to the low-level routines, you should have filtered out any bad data already.

Using array indices

switch ( queue ) { case 0 : letter = 'W'; break; case 1 : letter = 'S'; break; case 2 : letter = 'U'; break; }

if ( queue == 0 ) letter = 'W'; else if ( queue == 1 ) letter = 'S'; else letter = 'U';

static char *classes="WSU"; letter = classes[queue];

Aliases

void func1( int *data ) { int i; for(i=0; i<10; i++) { somefunc2( *data, i); } }

void func1( int *data ) { int i; int localdata; localdata = *data; for(i=0; i<10; i++) { somefunc2( localdata, i); } }

Integers

register unsigned int var_name;

Loop jamming

for(i=0; i<100; i++) { stuff(); } for(i=0; i<100; i++) { morestuff(); }

for(i=0; i<100; i++) { stuff(); morestuff(); }

Loop Unrolling and Dynamic Loop Unrolling

for(i=0; i<3; i++) { something(i); }

something(0); something(1); something(2);

for(i=0;i<limit;i++){ ... }

Listing 1

#include<stdio.h> #define BLOCKSIZE (8) void main(void) { int i = 0; int limit = 33; /* could be anything */ int blocklimit; /* The limit may not be divisible by BLOCKSIZE, * go as near as we can first, then tidy up. */ blocklimit = (limit / BLOCKSIZE) * BLOCKSIZE; /* unroll the loop in blocks of 8 */ while( i < blocklimit ) { printf("process(%d)

", i); printf("process(%d)

", i+1); printf("process(%d)

", i+2); printf("process(%d)

", i+3); printf("process(%d)

", i+4); printf("process(%d)

", i+5); printf("process(%d)

", i+6); printf("process(%d)

", i+7); /* update the counter */ i += 8; } /* * There may be some left to do. * This could be done as a simple for() loop, * but a switch is faster (and more interesting) */ if( i < limit ) { /* Jump into the case at the place that will allow * us to finish off the appropriate number of items. */ switch( limit - i ) { case 7 : printf("process(%d)

", i); i++; case 6 : printf("process(%d)

", i); i++; case 5 : printf("process(%d)

", i); i++; case 4 : printf("process(%d)

", i); i++; case 3 : printf("process(%d)

", i); i++; case 2 : printf("process(%d)

", i); i++; case 1 : printf("process(%d)

", i); } } }

for(i=0; i<10; i++) { do stuff... }

for( i=10; i--; )

Faster for() loops

for( i=0; i<10; i++){ ... }

If you don't care about the order of the loop counter, you can do this instead:

for( i=10; i--; ) { ... }

The syntax is a little strange, put is perfectly legal. The third statement in the loop is optional (an infinite loop would be written as "for( ; ; )" ). The same effect could also be gained by coding:

for(i=10; i; i--){}

for(i=10; i!=0; i--){}

switch() instead of if...else...

if( val == 1) dostuff1(); else if (val == 2) dostuff2(); else if (val == 3) dostuff3();

switch( val ) { case 1: dostuff1(); break; case 2: dostuff2(); break; case 3: dostuff3(); break; }

Pointers

void print_data( const bigstruct *data_pointer) { ...printf contents of structure... }

Early loop breaking

found = FALSE;

for(i=0;i<10000;i++)

{

if( list[i] == -99 )

{

found = TRUE;

}

}

if( found ) printf("Yes, there is a -99. Hooray!

");

This works well, but will process the entire array, no matter where the search item occurs in it.

A better way is to abort the search as soon as you've found the desired entry.

f ound = FALSE;

for(i=0; i<10000; i++)

{

if( list[i] == -99 )

{

found = TRUE;

break;

}

}

if( found ) printf("Yes, there is a -99. Hooray!

");

If the item is at, say position 23, the loop will stop there and then, and skip the remaining 9977 iterations.

In general, savings can be made by trading off memory for speed. If you can cache any often used data rather than recalculating or reloading it, it will help. Examples of this would be sine/cosine tables, or tables of pseudo-random numbers (calculate 1000 once at the start, and just reuse them if you don't need truly random numbers).

Avoid using ++ and -- etc. within loop expressions, eg. while(n--){}, as this can sometimes be harder to optimise.

Minimize the use of global variables.

Declare anything within a file (external to functions) as static, unless it is intended to be global.

Use word-size variables if you can, as the machine can work with these better ( instead of char, short, double, bitfields etc. ).

Don't use recursion. Recursion can be very elegant and neat, but creates many more function calls which can become a large overhead.

Avoid the sqrt() square root function in loops - calculating square roots is very CPU intensive.

Single dimension arrays are faster than multi-dimensioned arrays.

Compilers can often optimise a whole file - avoid splitting off closely related functions into separate files, the compiler will do better if can see both of them together (it might be able to inline the code, for example).

Single precision maths may be faster than double precision - there is often a compiler switch for this.

Floating point multiplication is often faster than division - use val * 0.5 instead of val / 2.0.

Addition is quicker than multiplication - use val + val + val instead of val * 3

puts() is quicker than printf(), although less flexible.

Use #defined macros instead of commonly used tiny functions - sometimes the bulk of CPU usage can be tracked down to a small external function being called thousands of times in a tight loop. Replacing it with a macro to perform the same job will remove the overhead of all those function calls, and allow the compiler to be more aggressive in it's optimisation..

Binary/unformatted file access is faster than formatted access, as the machine does not have to convert between human-readable ASCII and machine-readable binary. If you don't actually need to read the data in a file yourself, consider making it a binary file.

If your library supports the mallopt() function (for controlling malloc), use it. The MAXFAST setting can make significant improvements to code that does a lot of malloc work.If a particular structure is created/destroyed many times a second, try setting the mallopt options to work best with that size.

Last but definitely not least - turn compiler optimisation on! Seems obvious, but is often forgotten in that last minute rush to get the product out on time. The compiler will be able to optimise at a much lower level than can be done in the source code, and perform optimisations specific to the target processor.

If you find any of this helps to dramatically increase the performance of your software, please let me know.

Now Available - Vatican Approved Debugger