Originally posted on May 10th, 2009 in the now defunct Tuxology technoblog



GCC and G++, respectively, the GNU C compiler collection C and C++ language front ends, are a wondrous duo that usually do a decent job translating our C and C++ source into executable code. Well, most of the time at least…

Wonderful as these tools are, however, they are not omnipotent and cannot guess what’s on our minds. Would this function we just wrote get called very often or no? is that if statement where we are going to take the if branch more often then the else one, or is it the other way around? the compiler simply can’t tell, without our help that is.

Supplying the compiler some hints when writing code, can be very beneficial in enabling the compiler to properly optimise the code it produced for best performance, how ever that is defined.

In this post we shall describe several hints that we can provide to GCC and G++ compiler to help them understand our program run time behaviour better and ultimately produce better code.

There are two types of constructs to provide hints to GCC and G++ about the run time attributes of the code you write for the purpose of optimisation: function attributes and built-ins.

Function Attributes

Function attributes are modifiers you add to function definition that supply the compiler with additional information about that function run time behaviour. The canonical form of an attribute in GCC and G++ is:

__attribute__ ( ( name ) ) __attribute__((name))

Where “name” is one of the reserved attributes names. Many such attributes exists and you are encouraged to read the GCC documentation about them.

For example, this is how a function defined with the “pure” attribute will look like (more on the pure attribute itself later):

<>pre lang=”c”int my_func(int i) __attribute__((pure));

One very common shorthand form that is often used in large project (for example, the Linux kernel), is to use a pre-processor define to make the attribute use less verbose, like so:

#define __pure __attribute__((pure)) int my_func ( int i ) __pure ; #define __pure __attribute__((pure)) int my_func(int i) __pure;

Or, as I prefer:

int __pure my_func ( int i ) ; int __pure my_func(int i);

I shall use this shorthand form hence forth.

Now that we have established what are function attributes and how to use them, let us go over a list of such attributes which can be used to provide hints to the compiler and allow it to better optimise our code:



The pure function attribute



Many functions that we write are pure functions. A pure function is a function whose only effect is it’s return value and the return value, in turn, depends solely on the parameters to the function and/or gloval variables, such as that if we call the function with the same parameters and the same values in global variables we are guaranteed to get the same return value.

As an example, the functions strlen(3) and memcmp(3) are good examples for pure functions; their return value is solely dependant on their input parameters. On the other hand, the feof(3) function is a not a pure function, because, in a multi-threaded application, some other thread might have changed the state of the file the file descriptor is referencing between two calls of the function.

Compilers can make use of the knowledge that a certain function is a pure function by performing such optimisations such as common sub-expression elimination in the same way that this is done with arithmetic operators such as “+” or “*”.

Using the pure attribute, we can tell GCC and G++ that a certain function is a pure function and therefore might be a good candidate to optimisation. For example:

size_t __pure strlen ( const char * ) ; size_t __pure strlen(const char *) ;

Note that the pure attribute is not implemented in GCC versions earlier than 2.96.

The hot and cold function attributes



An additional pair of function attributes which you can use to help GCC and G++ emit better code are the hot and code function attributes:

__attribute__ ( ( hot ) ) __attribute__ ( ( cold ) ) __attribute__((hot)) __attribute__((cold))

As their name suggests, these function attributes are used to hint the compiler that the corresponding functions are called often in your code (hot) or seldom called (cold).

The compiler can then order the code in branches, such as if statements, to favour branches that call these hot functions and disfavour functions cold functions, under the assumption that it is more likely that that the branch that will be taken will call a hot function and less likely to call a cold one.

In addition, the compiler can choose to group together functions marked as hot in a special section in the generated binary, on the premise that since data and instruction caches work based on locality, or the relative distance of related code and data, putting all the often called function together will result in better caching of their code for the entire application.

Good candidates for the hot attribute are core functions which are called very often in your code base. Good candidates for the cold attribute are internal error handling functions which are called only in case of errors.

Here is a usage example:

int __hot call_me_often ( void ) ; int __cold handle_error ( void ) ; int __hot call_me_often(void); int __cold handle_error(void);

The hot and cold attributes are available GCC version from 4.3 and onwards.

Built-ins



Another form of hints that can be a programmer may provide to the compiler about the code run time behaviour for the purpose of optimisation, are calling a built-in compiler function.

Data pre-fetching and cache pre-population



As you are most probably aware, fetching information from RAM is a costly business in the terms of a computer program. This is why an elaborate system of internals on CPU and off CPU caches are used to store recently accessed data and instructions in the hope that these will be needed again soon and thus the cost of fetching them from RAM can be saved.

Most chip architectures and compilers do a pretty decent job working together to automatically populated the various caches with the right data and instructions in order to achieve better performance. Sometime however, a programmer can provide a hint to the compiler about what data might be needed in advance of time, thus allowing the compiler to emit instruction that will fetch the data from RAM to the cache while the CPU is busy doing other processing, thus making sure the data is available in the cache when the CPU gets around to needing it.

The way to do this with GCC and G++ is to use the built-in pre-fetch function:

void __builtin_prefetch ( const void * addr , int rw , int locality ) ; void __builtin_prefetch( const void *addr, int rw, int locality );

The built-in pre-fetch function will cause the compiler to emit instructions that will pre-populate the cache with the data at address addr. Note that it is allowed to provide an illegal address to pre-fetch.

The optional rw parameter indicated whether the requested access is for read purposes or write (1 denotes read/write access while 0 denotes read only), whereas the optional locality parameter describes the temporal locality of the data. That is, the degree to which it is likely the same data will be used again soon and therefore whether or not to keep the data in the cache after it is used (0 means “do not keep at all” and 3 denotes “keep as long as possible”).

As an usage example, consider the common operation of traversing a linked list. Typically, we iterate over the list, processing each item in turn and then progressing to the next list item:

while ( ! item ) { item -> field ++; // more processing... item ++ } while (!item) { item->field++; // more processing... item++ }

Using the built-in pre-fetch function, we can write the same code like this:

while ( ! item ) { // Not NULL check since invalid addr OK __builtin_prefetch ( item -> next , 1 , 1 ) ; item -> field ++; // more processing... item ++ } while (!item) { // Not NULL check since invalid addr OK __builtin_prefetch(item->next, 1, 1) ; item->field++; // more processing... item++ }

Adding the pre-fetch call will allow the compiler to issue instructions that will get the CPU to start the transaction that will get the data for the next item fetched from the main RAM to the cache while processing the current field which is in the cache. With any luck, by the time the CPU finished processing the current item, the next item will have already been pre-fetched and populated the cache, just in time for it being processed.

Proper use of pre-fetching and cache pre-populating can sometime have a dramatic effect on the performance of some kinds of computer programs.

Branch annotation



Branches in program code path, such as those that happen due to an if statement, are challenging for modern CPUs. As most modern day CPUs makes use of multiple pipelines of instructions that execute in parallel, a branch in a program is problematic since the CPU does not know for sure in advance which branch will be taken until it has executed the entire instructions that lead up to the branch, thus stalling the pipeline.

The solution taken by CPU makers for this problem is the introduction of branch prediction logic into the CPU which tries to estimate which branch will be taken and try to advance with the pipeline based on that estimation. If the assumption is later proved to be wring, the CPU aborts the pipeline and starts over; a costly business.

As you probably have guessed by now though, we can use the knowledge we have as programmers about the program flow and hint the compiler about the chances that a certain branch, or if statement, will be taken or not. The compiler in turn, will use this information to bias the CPU to prefer to proceed with the most probable branch, the one least likely to cause a hazard and an abort of the pipeline.

Other benefits are that the compiler can re-order the binary code such that the code most likely to execute in a branch will be closest to the branch instruction itself, thus preserving code locality and therefore assisting caching.

Providing the GCC and G++ compilers with the probable outcome of a branch, we use the built-in expect function:

__builtin_expect ( !! ( x ) , e ) __builtin_expect(!!(x), e)

The function is used by applying it to a branch predicate, such as in the following example:

if ( __builtin_expect ( !! ( condition ) , 1 ) ) { ... } if(__builtin_expect(!!(condition), 1)) { ... }

The x parameter to the built-in expect function is the predicate we test in the if statement and the e parameter is the most likely outcome of evaluating the predicate as a Boolean value: true or false.

Often, the built-in expect function is wrapped for convenience using two macros. Such is the case in the Linux kernel source, for example –

#define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0)

Using these macros, the usage becomes the following, which is much more readable:

if ( likely ( condition ) ) { ... } if(likely(condition)) {...}

It should be noted that GCC also provide a run time parameter -fprofile-arcs, which can profile the code for the actual statistics for each branch and the use of it should be prefered above guessing.

Conclusion



We have detailed above several note worthy way to providing hints to GCC and G++ compilers about your code behaviour in run time, hints that allow the compiler to better optimise your code and therefore achieve superior performance. Making a habit of using these hints as part of your normal code writing routine can easily provide substantial returns on a long enough scale.

Care should be taken however, to not get these hints wrong, as there is nothing worse then providing your compiler with the wring hint and thus “pessimising” your code instead of optimising it.