The other day I was investigating the performance profile of a stress test in our Coherent Labs Renoir graphics library. Renoir collects high-level rendering commands like “DrawText”, “DrawPath” etc, and transforms them in low-level API commands for the GPU.

What caught my eyes was that the function BlendingMode2State was taking ~ 1.5% of the time. The function is called for each drawing command (hundreds of times per-frame), but still this looked disproportional. The function is declared with the following signature:

inline BlendingState BlendingMode2State(BlendModes mode)

BlendModes is an enum that contains pre-defined blending modes and BlendingState is a simple structure that contains the required GPU operations to implement that blend mode (SrcBlend, DestBlend etc. for more information on graphics alpha blending, take a look here)

Looking at the implementation made things clear:

The implementation immediately rang a bell. The static variable holding the mapping of the blend modes is declared at function level. According to the lifetime rules of C++, it’ll be initialized the first time the function is called. This bring the side effect the in each cal there is branch that checks if the variable has been initialized. Not good.

Checking the disassembly revealed another inefficiency. The initialization assembly was 2 screens long at the beginning of the function – there are a lot of BlendModes. The assembly did a conditional jump if the variable is initialized and skipped hundreds of bytes worth of initialization code to go to the gist of the function, which is just returning the correct entry in the array.

The third problem was that the initialization code hindered the inlining of the function. It was so large that it made sense that the compiler ignored the “inline” request.

To recap, there were actually 3 linked issues with the function:

static variable at local scope requires a branch on each call

massive initialization code causes potential instruction cache trashing

inlining is impossible due to the initialization code withing the function

The fix was trivial:

I moved the variable outside the function. This allowed for proper function inlining and gave an overall 1% performance increase of the library in my test. Not bad for 10 minutes work.

In your work avoid static variables at function level – they are bad practice anyway, mostly used for singleton objects whose lifetime has to be lazy. If you have such an object that is accessed very often, you might have a similar performance hit.