Four Habit-Forming Tips to Faster C++

Are you a victim of premature pessimisation? Here’s a short definition from Herb Sutter:

Premature pessimization is when you write code that is slower than it needs to be, usually by asking for unnecessary extra work, when equivalently complex code would be faster and should just naturally flow out of your fingers.

Despite how amazing today’s compilers have become at generating code, humans still know more about the intended use of a function or class than can be specified by mere syntax. Compilers operate under a host of very strict rules that enforce correctness at the expense of faster code. What’s more, modern processor architectures sometimes compete with C++ language habits that have become ingrained in programmers from decades of previous best practice.

I believe that if you want to improve the speed of your code, you need to adopt habits that take advantage of modern compilers and modern processor architectures—habits that will help your compiler generate the best-possible code. Habits that, if you follow them, will generate faster code before you even start the optimisation process.

Here’s four habit-forming tips that are all about avoiding pessimisation and, in my experience, go a long way to creating faster C++ classes.

1) Make use of the (named-) return-value optimisation

According to Lawrence Crowl, (named-) return-value optimisation ((N)RVO) is one of the most important optimisations in modern C++. Okay—what is it?

Let’s start with plain return-value optimization (RVO). Normally, when a C++ method returns an unnamed object, the compiler creates a temporary object, which is then copy-constructed into the target object.

MyData myFunction() { return MyData(); // Create and return unnamed obj } MyData abc = myFunction();

With RVO, the C++ standard allows the compiler to skip the creation of the temporary, treating both object instances—the one inside the function and the one assigned to the variable outside the function—as the same. This usually goes under the name of copy elision. But what is elided here is the temporary and the copy.

So, not only do you save the copy constructor call, you also save the destructor call, as well as some stack memory. Clearly, elimination of extra calls and temporaries saves time and space, but crucially, RVO is an enabler for pass-by-value designs. Imagine MyData was a large million-by-million matrix. There mere chance that some target compiler could fail to implement this optimisation would make every good programmer shy away from return-by-value and resort to out parameters instead (more on those further down).

As an aside: don’t C++ Move Semantics solve this? The answer is: no. If you move instead of copy, you still have the temporary and its destructor call in the executable code. And if your matrix is not heap-allocated, but statically sized, such as a std::array<std::array<double, 1000>, 1000>> , moving is the same as copying. With RVO, you mustn’t be afraid of returning by value. You must unlearn what you have learned and embrace return-by-value.

Named Return Value Optimization is similar but it allows the compiler to eliminate not just rvalues (temporaries), but lvalues (local variables), too, under certain conditions.

What all compilers these days (and for some time now) reliably implement is NRVO in the case where there is a single variable that is passed to every return, and declared at function scope as the first variable:

MyData myFunction() { MyData result; // Declare return val in ONE place if (doing_something) { return result; // Return same val everywhere } // Doing something else return result; // Return same val everywhere } MyData abc = myFunction();

Sadly, many compilers, including GCC, fail to apply NRVO when you deviate even slightly from the basic pattern:

MyData myFunction() { if (doing_something) return MyData(); // RVO expected MyData result; // ... return result; // NRVO expected } MyData abc = myFunction();

At least GCC fails to use NRVO for the second return statement in that function. The fix in this case is easy (go back to the first version), but it’s not always that easy. It is an altogether sad state of affairs for a language that is said to have the most advanced optimisers available to it for compilers to fail to implement this very basic optimisation.

So, for the time being, get your fingers accustomed to typing the classical NRVO pattern: it enables the compiler to generate code that does what you want in the most efficient way enabled by the C++ standard.

If diving into assembly code to check whether a particular patterns makes your compiler drop NRVO isn’t your thing, Thomas Brown provides a very comprehensive list of compilers tested for their NRVO support and I’ve extended Brown’s work with some additional results.

If you start using the NVRO pattern but aren’t getting the results you expect, your compiler may not automatically perform NRVO transformations. You may need to check your compiler optimization settings and explicitly enable them.

2) Return parameters by value whenever possible

This is pretty simple: don’t use “out-parameters”. The result for the caller is certainly kinder: we just return our value instead of having the caller allocate a variable and pass in a reference. Even if your function returns multiple results, nearly all of the time you’re much better off creating a small result struct that the function passes back (via (N)RVO!):

That is, instead of this:

void convertToFraction(double val, int &numerator, int &denominator) { numerator = /*calculation */ ; denominator = /*calculation */ ; } int numerator, denominator; convertToFraction(val, numerator, denominator); // or was it "denominator, nominator"? use(numerator); use(denominator);

You should prefer this:

struct fractional_parts { int numerator; int denominator; }; fractional_parts convertToFraction(double val) { int numerator = /*calculation */ ; int denominator = /*calculation */ ; return {numerator, denominator}; // C++11 braced initialisation -> RVO } auto parts = convertToFraction(val); use(parts.numerator); use(parts.denominator);

This may seem surprising, even counter-intuitive, for programmers that cut their teeth on older x86 architectures. You’re just passing around a pointer instead of a big chunk of data, right? Quite simply, “out” parameter pointers force a modern compiler to avoid certain optimisations when calling non-inlined functions. Because the compiler can’t always determine if the function call may change an underlying value (due to aliasing), it can’t beneficially keep the value in a CPU register or reorder instructions around it. Besides, compilers have gotten pretty smart—they don’t actually do expensive value passing unless they need to (see the next tip). With 64-bit and even 32-bit CPUs, small structs can be packed into registers or automatically allocated on the stack as needed by the compiler. Returning results by value allows the compiler to understand that there isn’t any modification or aliasing happening to your parameters, and you and your callers get to write simpler code.

3) Cache member-variables and reference-parameters

This rule is straightforward: take a copy of the member-variables or reference-parameters you are going to use within your function at the top of the function, instead of using them directly throughout the method. There are two good reasons for this.

The first is the same as the tip above—because pointer references (even member-variables in methods, as they’re accessed through the implicit this pointer) put a stick in the wheels of the compiler’s optimisation. The compiler can’t guarantee that things don’t change outside its view, so it takes a very conservative (and in most cases wasteful) approach and throws away any state information it may have gleaned about those variables each time they’re used anew. And that’s valuable information that can help the compiler eliminate instructions and references to memory.

Another important reason is correctness. As an example provided by Lawrence Crowl in his CppCon 2014 talk “The Implementation of Value Types”, instead of this complex number multiplication:

template <class T> complex<T> &complex<T>::operator*=(const complex<T> &a) { real = real * a.real – imag * a.imag; imag = real * a.imag + imag * a.real; return *this; }

You should prefer this version:

template <class T> complex<T> &complex<T>::operator*=(const complex<T> &a) { T a_real = a.real, a_imag = a.imag; T t_real = real, t_imag = imag; // t == this real = t_real * a_real – t_imag * a_imag; imag = t_real * a_imag + t_imag * a_real; return *this; }

This second, non-aliased version will still work properly if you use value *= value to square a number; the first one won’t give you the right value because it doesn’t protect against aliased variables.

To summarise succinctly: read from (and write to!) each non-local variable exactly once in every function.

4) Organize your member variables intelligently

Is it better to organize member variables for readability or for the compiler? Ideally, you pick a scheme that works for both.

And now is a perfect time for a short refresher about CPU caches. Of course data coming from memory is very slow compared to data coming from a cache. An important fact to remember is that data is loaded into the cache in (typically) 64-byte blocks called cache lines. The cache line—that is, your requested data and the 64 bytes surrounding it—is loaded on your first request for memory absent in the cache. Because every cache miss silently penalises your program, you want a well-considered strategy for ensuring you reduce cache misses whenever possible. Even if the first memory access is outside the cache, trying to structure your accesses so that a second, third, or forth access is within the cache will have a significant impact on speed. With that in mind, consider these tips for your member-variable declarations:

Move the most-frequently-used member-variables first

Move the least-frequently-used member-variables last

If variables are often used together, group them near each other

Try to reference variables in your functions in the order they’re declared

Keep an eye out on alignment requirements of member-variables, lest you waste space on padding

Nearly all C++ compilers organize member variables in memory in the order in which they are declared. And grouping your member variables using the above guidelines can help reduce cache misses that drastically impact performance. Although compilers can be smart about creating code that works with caching strategies in a way that’s hard for humans to track, the C++ rules on class layout make it hard for compilers to really shine. Your goal here is to help the compiler by stacking the deck on cache-line loads that will preferentially load the variables in the order you’ll need them.

This can be a tough one if you’re not sure how frequently things are used. While it’s not always easy for complicated classes to know what member variables may be touched more often, generally following this rule of thumb as well as you can will help. Certainly for the simpler classes (string, dates/times, points, complex, quaternions, etc) you’ll probably be accessing most member variables most of the time, but you can still declare and access your member variables in a consistent way that will help guarantee that you’re minimizing your cache misses.

Conclusion

The bottomline is that it still takes some amount of hand-holding to get a compiler to generate the best code. Good coding-habits are by no means the end-all, but are certainly a great place to start.