

Author: “No Bugs” Hare Follow: Job Title: Sarcastic Architect Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,

Calling a Spade a Spade, Keeping Tongue in Cheek

[[This is Chapter 16(b) from “beta” Volume V of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “ Book Beta Testing “. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

The opposite of a fact is falsehood, but the opposite of one profound truth may very well be another profound truth. Niels Bohr —

There are quite a few common wisdoms when it comes to C++ and games. As it always the case when facing a bunch of common wisdoms, some of them have their merits, some are obsolete-beyond-belief, and some are just taken from a very different context and are not really applicable. Let’s take a look at the most popular ones.

“No C++, just plain C” – not for app-level programming

There is a school which teaches that C++ is no good, for performance and reliability reasons.

Many proponents of this point of view quote [Linus] who has said that “C++ is a horrible language”. However, it should be noted, that while Linus MAY have a point against using C++ for operating systems,1 his rant should not be easily generalized to such a different area as games.

“over(ab)using C++ features is a different story, we’ll discuss these features below on case-by-case basisIn short – code in OS’s is changed some orders of magnitude less frequently than for games (or any other business app), so the problem of code maintenance becomes of much more importance in a game world. On the other hand, things such as “we need to handle even the most extreme corner cases, such as being unable to allocate 20 bytes of memory” are of much less importance for a usual game app (whether client or server). As a result, I am firmly going against the camp which says that “C++ is no good for game development” (over(ab)using C++ features is a different story, we’ll discuss these features below on case-by-case basis).

On “C++ as C-with-Classes”

A bit less extreme point of view is that C++ is good, but only as long as we’re using it only as “C with classes”. Well, often I myself am guilty of using C++ along these lines (with a few notable exceptions, pun intended) ;-).

Interpretation of the term “C-with-classes” varies between different developers,2 so when discussing it, I strongly prefer to separate those-C++-features-going-beyond-plain-C, and discuss them one-by-one, to see whether these features are worth using in the context of game development.

In any case, there is more or less a consensus that “classes” in “C with Classes” is a Good Thing™ (in particular, as it encourages encapsulation and decoupling). Now let’s discuss more controversial features of C++.

C++ Exceptions – (Almost-)Zero-Cost until Thrown

There is a “common wisdom” that C++ exceptions are Bad, because they’re damn expensive. Another school of thought says that modern table-driven C++ exceptions are zero cost until they’re thrown. Well, while I am not 100% convinced that exception handling is indeed exactly “zero-cost” until the exception is thrown, but it is certainly very very close. The most common exception-related thing which may affect performance of your app, is an additional cost of calling non-inlined function (and in ancient times there was an additional cost to it due to exceptions, in the range of single-digit CPU clocks). However:

“ It seems that this per-function-call penalty is gone now (though I’m not 100% convinced whether it is 100% gone, and that it is gone on 100% of the widely-used compilers) [TR18015].

[TR18015]. Moreover, there are claims that C++ exceptions are less expensive (until exception is thrown) compared to C-style error-code checking after the function returns. While I didn’t run my own tests to validate this claim, I can tell that the performance difference (if any) between error checking and C++ exception in non-error case is certainly negligible 3 4

to be honest, if you have a function which is that performance-critical, you should force-inline it anyway (more on inlines and force-inlines below), and then even for older compilers the problem will go away

On the other hand, the cost of exception when it is thrown, is pretty large; by the very definition of exception handling, it has to perform RTTI or RTTI-like things (and usually it is direct RTTI, see, for example, [Brailovsky]). In one experiment, the cost of exception thrown-and-caught has been measured to be of the order of 1000-2000 CPU clocks [Ongaro].

Practical conclusions with regards to exceptions:

DO use exceptions; error codes are usually much more error-prone in the context of games. 5

“ Cost of exception handling until the exception is thrown, is negligible

Cost of exception when it is thrown, is significant. Therefore: DON’T use exceptions to return from the inner loop However, using exceptions to perform not-so-normal return from a top-level event handler, is generally ok (as top-level event handlers usually take at least hundreds of thousands of clocks anyway).



Smart Pointers and STL: Allocations are Still Evil

“In other words, if you replace an std::unique_ptr<> with a simple manually-handled new – well, you will get nothing except for an additional headache.Smart pointers are often considered a Big No-No for games. However, I contend that it is not smart pointers per se, but rather related allocations which will hurt your performance (see more on allocations and related lack of spatial locality in “Allocations are Evil. Even More So Because of Locality” section above). In other words, if you replace an std::unique_ptr<> with a simple manually-handled new (or, Stroustrup forbid, with (YourClass*)malloc(sizeof(YourClass))) – well, you will get nothing except for an additional headache. Note though that std::shared_ptr<> is different performance-wise (and usually you CAN beat it as long as you’re sharing only within the same thread).

When it comes to using STL in the context of games, I should admit that it is quite controversial. Overall, I love STL, but it does have its shortcomings when it comes to high-performance stuff, games included. My main concern about STL is that it uses way too many allocations for my taste (and more importantly, for poor CPU cache too), and as it was discussed in “Allocations are Evil. Even More So Because of Locality” section above, allocations are one of the most nasty things performance-wise out there 🙁 .

As a result, I tend to advise along the following lines (YMMV, batteries not included):

DO use STL containers sparingly.

DON’T use STL containers just for the sake of using them. DO use arrays (including std::array<>) when size is known or very limited (that is, properly wrapped with bound checks) where applicable. Even better, DO use eastl::fixed_*<> containers (see more on EASTL below). The same applies to char arrays in lieu of std::string (though for std::string it is usually mitigated these days by small string optimisation (SSO), see [[TODO]] section above). Once again, proper wrapping (such as in eastl::fixed_string) is necessary.

“ As a rule of thumb, DO favour std::vector<> over std::list<> (as it has MUCH better locality). Use cases for std::list<> do exist, but they’re rather few and far between. BTW, one thing which makes our life better under C++11, is that std::list<>::size() is now required to be constant-time a.k.a. O(1) (pre-C++11 it was allowed – and often was – O(N) which was counterintuitive and caused significant problems 🙁 ). If this is critical for you I’d still suggest to double-check that it stands for your compiler/library.

As a rule of thumb (and unless you need ordering), DO favour std::unordered_map<> over std::map<> (the same applies to all tree-based containers, namely to multimap<>, set<>, and multiset<>). std::unordered_*<> containers tend to cause MUCH less cache misses when performing searches, and as a result tend to be MUCH more efficient. Another way of reaching the same conclusion is to say that unless you need ordering – you should tell the world about it by using std::unordered_*<>, making the code more readable. Caveat: hash table collisions can be a Quite a Bad Thing for std::unordered_*<> containers, including potentially opening a door for DoS attacks (if your keys come from Clients) 🙁 . Be careful to make sure your hash function is good for your keys, have run-time reporting on number of collisions for critical collections, and see Vol.3 for discussion on DoS. If you do need ordering, DO consider sorted arrays/vectors instead of std::map<> (or even better, eastl::vector_map<> etc., more on EASTL below). While not always possible (due to MUCH higher addition/deletion costs), quite often it can make wonders.

DO use optimized STL versions such as [EASTL] DO use eastl::fixed_*<> containers (they have “overflow feature” too). DO use eastl::intrusive_*<> containers where applicable DO use eastl::vector_map<> etc.

“ Most importantly, DO understand underlying implementation of STL containers std::vector<> is a single allocated block with some reserve (reallocated when you go over reserved portion). Which can be translated to “very good data locality” 🙂 🙂 std::list<> is a double-linked list. Which means “locality is poor” 🙁 std::map<>, std::multimap<>, std::set<>, and std::multiset<> are all binary trees. Means “lots of allocations, very poor data locality” 🙁 🙁 std::unordered_*<> containers are hash maps. Can be translated into “risk of collisions, and quite a few allocations but pretty good data locality for your usual find()” 🙂

DO consider [short_alloc] where applicable – that is, after you understand its implications (it is an on-stack allocator). Whenever it works (and this is usually limited to temporaries only) – it saves quite a bit of CPU clocks while allowing you to use your usual std:: containers.

In addition to STL containers, DO use STL algorithms too. Most common candidates are those things which are not that trivial to write yourself, including binary search stuff (binary_search()/lower_bound()/upper_bound()), random_shuffle() (yes, it is very easy to make a mistake in shuffle), all kinds of sort, and all kinds of operations on sorted ranges. However, DON’T think that STL will do everything for you (doing so will likely to cause you think that everything out there is a nail). In some cases, you DO need to write your own algos, it is just that whenever the-algo-you-need is already written in STL – well, it is usually better to use it :-). Whether to use trivial algos such as find_if() etc. – is up to you.



Using boost:: – Case by Case and Only If You Do Need It

You may ask: “ok, you’ve said about STL, but what about boost::?” My position on boost:: usually goes as follows:

almost-all the MUST-HAVE stuff from boost:: is already in std::, phew (and THANKS A LOT to boost:: guys to make it happen) one thing which you MAY still want is boost::Container; note though that it has significant overlaps with eastl (for example, boost::container::flat_*<> containers are very similar to eastl::vector_{set|map}<>, and boost::container::static_vector<> and boost:container::small_vector<> are very similar to eastl::fixed_vector<>). Overall, from the point of view of containers, as of beginning of 2016 I still tend to prefer EASTL.

boost:: as a whole is a Damn Large set of libraries

as a result, while there CAN be things in boost:: which you MIGHT need, I do NOT recommend to say “we’re using all the boost::”. Instead, I suggest an approach of: waiting until a need for a specific boost:: library arises discussing it (and implications, including performance ones) by the team and deciding that “yes, this specific boost:: library is a good fit for our purposes”



Polymorphism – Two-Fold Impact

“in practice, there are usually two different costs associated with polymorphismYet another performance-related “common wisdom” says that polymorphism is a Bad Thing performance-wise and should be avoided. Well, I need to say that there in practice, there are usually two different costs associated with polymorphism. The first (explicit) cost is the cost of taking a VTABLE pointer from the body of your polymorphic class, and going into VTABLE (which MAY cause a cache miss, though it is unlikely to cause any real issues as VTABLE is going to be cached if it is accessed anywhere close to often). In addition, there is usually an additional cost similar in nature to that of branch misprediction (see below).

However, there is a second (implicit) cost of polymorphism as-it-is-traditionally-used. Usually, to exploit polymorphism, we’re creating our polymorphed objects via new. And, as I noted above quite a few times, new is a Bad Thing performance-wise (in particular, because in practice it causes that dreaded cache miss much more frequently than VTABLE access). On the positive side, this second part of polymorphism-related costs CAN be avoided (see discussion on it in “Mitigation: Reducing number of Allocations” section above).

Overall, unless you’re using polymorphism in some Really Crazy manner, I didn’t see it to cause anywhere significant performance degradation (that is, beyond the costs of allocation+poor locality).

RTTI – Can be Easily Avoided Most of the Time

RTTI (a.k.a. Run-Time Type Information) is quite expensive when it’s used. Honestly, I haven’t seen any real-world uses for RTTI beyond dynamic_cast<> (while such uses MIGHT exist, I certainly missed out on them).

And when it comes to dynamic_cast<>, my position goes as follows:

“ most of the time, it is trivial to replace dynamic_cast<> (together with related code) with your own virtual function

If you CANNOT replace dynamic_cast<> with your own virtual function (and the only case I know for it, is when you’re not allowed to modify base class) – then it MAY be a legitimate use for dynamic_cast<>.

And that’s about it on RTTI.

On Branch (mis)-Predictions and Pipeline Stalls

One further topic which is routinely raised in the context of performance, is branch predictions and pipeline stalls. I won’t go into a lengthy discussion about it (see, for example, [Wikipedia.BranchPredictor] for Branch Predictions 101), but what really matters from our perspective, is an observation that:

If there is an “if” statement in the program,6 and execution goes along the “wrong” path7 – it is going to cost us

While the cost of misprediction is not that high (10-20 CPU clocks), in some of the Very Inner Loops it might start playing its role.

Dealing with mispredictions can be done in one of the two following ways.

The first way is to remove conditional processing completely (NB: do NOT do this until you are sure that your bottleneck is within this piece of code). For example,

a += (b ? 0 : c);//all operands including b are uint32_t // b is guaranteed to be either 0 or 1

can be rewritten into:

a += (b-1)&c;//no conditional processing here assert( ((b-1)&c) == (b ? 0 : c) ); //NB: if your b is not guaranteed to be 0 or 1, // you can just use (!!b-1)&c - most of the time it will compile // without any branching too

Here (b-1) can take values of 0 or (uint32_t)(-1) (the latter equal to 0xFFFFFFFF). Then, bitwise-anding it with c, we’ll get either 0 or c, which is exactly what we need. Note that operand types are Really Important here (and DON’T try to use such tricks until (a) you’re sure that your bottleneck is here, (b) you know EXACTLY what you’re doing).

In theory, compiler should be able to optimize these things for us. In practice, however, more often than not it doesn’t 🙁 (things MAY improve in the future though). So, for Really Performance-Critical pieces of code we MAY try doing it ourselves (NB: after making such a change, we need to measure performance, and if the trick didn’t help – rollback the change to keep the code more readable).

[[TODO: add ref to “Bit Hacks for Games” from “Game Engine Gems 2”]

As with pretty much any other optimisation trick – make sure NOT to use such things unless you have already identified your bottleneck, otherwise you’re risking to end up with lots of convoluted and barely-readable code, and from my experience, a convoluted unreadable code tends to inhibit further (often MUCH more significant) optimizations.

“The second way of dealing with branch misprediction, is to tell the compiler which of the branches is the “right” (more likely) way of the code execution goingThe second way of dealing with branch misprediction, is to tell the compiler which of the branches is the “right” (more likely) way of the code execution going. If your code gets into each branch with 50% probability, you can’t really define “right” and “wrong” (and you cannot do much about branch misprediction besides replacing it with arithmetics), but if probability is not that even, you DO have a chance to deal with it.

To tell GCC/Clang which branch is “right”, you can use __builtin_expect() intrinsic pseudo-function (better still, use likely()/unlikely() macros as described here: [Kerrisk]; also if you’re about to use these things, make sure to read the whole post of his, there is a lot of useful stuff there). For MSVC, there are no such intrinsics, but there is an observation (which stands in most, but not in all cases [TODO]) that the “if” branch of the “if” statement is usually treated by MSVC compiler as “likely” (and “else” branch as “unlikely”).

However, the most important thing about using likely() is the following:

DON’T use likely() etc. until you’re sure that It Is Your Bottleneck! And keep in mind that making a wrong guess can easily kill performance 🙁 .

On inline asm and SSE

There are cases where you may want to write inline asm. However, they should be REALLY few and far between. Contrary to one popular belief, inline asm still CAN improve your program. And contrary to another popular belief, it is Very Rarely worth the trouble. If you decide to go against my advice and go inline asm for app-level code, keep in mind that:

“ Correct operation of inline asm MAY easily depend on the context where you use it In particular, if you have your inline asm as a part of inline function, and make a mistake with a clobbered register list for GCC – all kinds of things can happen (the code may work for one particular invocation of this inline, and when you call it from a different place – a dragon can eat your code, or even worse – an “undefined behavior” can happen).

Inline asm is Very Different for GCC and MSVC (and on MSVC it doesn’t even exist for x64 and ARM)

At some point in time, x86 SSE was an excuse to use inline asm; these days, fortunately enough, SSE intrinsics were added, so you can program SSE code without the hassle of inline asm (and in mostly compatible-between-GCC-and-MSVC-manner, though obviously still restricted to x86/x64). In short – for SSE, DO use intrinsics rather than inline asm (well, as long as it is possible).

On inlines and force-inlines

One thing which is often underestimated in performance department, is the cost of function calls. [Efficient C++] gives the cost of function call at 25-250 clocks (depending on number of arguments), and my personal experience for your-usual-function-with-one-or-three-arguments is about 20-30 clocks, so we’re pretty much on the same page here.

If it happens in the innermost loop of your algorithm, such a cost can be very considerable. And worse than that,

From my experience, as of 2016 compilers are still notoriously bad with deciding whether to ignore you inline specification or not 🙁 . 8

“Compiler will use all its Next-to-Divine Wisdom to show you that it is smarter than you are, and to ignore most of those inline specifications you carefully wrote.Yes, you read it correctly: if you designated your function as an inline, it doesn’t mean that it will be inlined. Compiler will use all its Next-to-Divine Wisdom to show you that it is smarter than you are, and to ignore most of those inline specifications you carefully wrote. In such cases, forcing inline usually helps; it can be done with __forceinline in MSVC, and __attribute__((always_inline)) in GCC (and BTW, make sure to make a cross-platfrom #define FORCEINLINE that works everywhere).

However, as with other performance tricks, I need to re-empasize:

DON’T overdo it. Use force-inline ONLY after you have identified the bottleneck. Indiscriminate use of force-inline can easily hurt performance.

One further inline-related trick applies if you’re too lazy to put all your inlinable functions into headers (which you normally SHOULD); in this case, -flto option of GCC/Clang has been observed to allow for usual (non-forced) inlining across your .obj files (it also allows for a whole bunch of other optimisations, so it is worth trying anyway).

On templates

One thing which was a long-running scare of early-days C++, is “template bloat”. It refers to the compiler creating numerous instances of pretty-much-the-same code when it instantiates templates. In practice, I didn’t see it as a real problem (that is, unless you’re doing Really Crazy Things), and these days C++ templates are even used on MCUs which have very limited amount of program memory. As we’re not on MCUs (phew), we should be generally be fine with templates (that is, unless we’re into deeply recursive templates – these DO have a potential to get ugly 🙁 ).

If code bloat is still a concern, a technique similar to the one in [Druy] can be used, though I’ve never seen it to be Desperately Needed in practice.

On compiler flags

“One of the first things novice developers are trying to do when optimizing for performance, is trying to play with compiler flags.One of the first things novice developers are trying to do when optimizing for performance, is trying to play with compiler flags. Well, usually it helps; OTOH, it helps only a little bit compared to what a reasonable code optimization can achieve (and even less compared to what removing of the outright-crazy-stuff can achieve). The main problem with compiler flags is that usually they’re applied indiscriminately to the whole project, and most of the optimizations are not that universally useful as we’d like. Better results can often be achieved by applying flags on per-subproject (or even per-file) basis.

For GCC, the most popular performance-affecting options are:

-O2 or -O3 (note that O3 often produces worse performing code than O2, so you need to try both(!))

-march (well, at least for Server-Side you know your architecture exactly, and for Client-Side finding something less than -march=corei7, will be quite difficult these days). Also for Server-Side you MAY want to consider compiling on exactly the target server with -march=native.

-fomit-frame-pointer (not much gain, but why not?)

-funroll-loops (results vary, but certainly worth trying for critical code)

-flto (make sure to read GCC/Clang recommendations on using it)

-fno-math-errno

-ffast-math (includes -fno-math-errno and a whole bunch of other stuff) and -ftree-vectorize. NB 1: these options MAY cause not-exactly-IEEE-compliant behaviour for floating-point math. NB 2: I’ve heard about a significant synergy between these two options, so you SHOULD try them together (in addition to trying them one by one).

and don’t forget about -fprofile-generate; it is going to take (quite) a bit of effort, but is often worth it (that is, if you’re concerned about performance)

Note that I’m not saying that these options will necessarily help; what I’m saying is that you MAY want to try them, run your compiled app, and see whether there is any difference.

On Manual Loop Unrolling

Using compiler option such as -funroll-loops, can make your app run faster – or slower, if you’re not that lucky. However, it doesn’t mean that loop unrolling is useless – it just means that indiscriminate loop unrolling can have their problems.

As a result, I’ve seen manual unrolling of the innermost most-critical loop to help quite a bit (like “up to 2x improvement”). It is a Big Headache though, so once again –

NEVER EVER DO IT until you’re sure that it is your bottleneck.

[[TODO: x64-specific: on float-to-int switching (also on SSE)]]

Putting Things into Perspective: Rough Costs in Clocks

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3% Donald Knuth —

The following table tries to summarize some of the CPU clock costs which typically occur in the program:9

Operation Cost in CPU clocks “Simple” register-register op (ADD,OR,etc.) <1 Memory write ~1 “right” branch of “if” ~1 Integer Multiplication 3-6 L1 read 4 L2 read 10-12 “wrong” branch of “if” (branch misprediction) 10-20 L3 read 30-50 Function call 25-250 Additional cost of the function call being polymorphic (NOT accounting for cache misses due to poor locality if allocation is used) 20-3010 Main RAM read 100-150 NUMA: different-socket L3 read 100-200 NUMA: different-socket main RAM read 200-300 Allocation+deallocation pair 200+11 User-to-kernel switch and back (i.e. minimal cost of kernel function) 300-500 Exception thrown+caught 1000-2000 Context switch (direct costs) 2000 Context switch (total costs, including cache invalidation) 10K-1M

And a practical suggestion based on the numbers above:

Three things Really Matter for performance. The first one is Algorithm, the second one is your code being Non-Blocking, and the third one is Data Locality. The rest of the tricks, while potentially useful, should be used only on case-by-case basis, when a bottleneck is identified. Otherwise it will be a Premature Optimisation.

“For-free” Optimizations a.k.a. Avoiding Premature Pessimization

One exception to the “DON’T optimize too early” rule above is “optimizations” which do NOT cause any trouble when you’re developing (in particular, do NOT reduce readability and compatibility). In some sources (see, for example, [Sutter]), not doing such “for-free” optimizations is even referred to as “premature pessimization”.

“One example of such a 'for-free' optimization is making sure that all your loops have standalone increments written as ++it instead of it++.One example of such a “for-free” optimization is making sure that all your loops have standalone increments written as “++it” instead of “it++”. The point here is that any kind of postfix operation on a non-integer type (such as iterator), strictly speaking, requires a temporary copy. While compiler MAY be able to eliminate this temporary as a part of optimization, well – why not save it the trouble (and not to avoid any chance that it will be suboptimal on some-other-compiler) –

Especially as it costs you pretty much nothing?

Other tricks of the similar “cost-nothing” nature include:

passing structs and classes by const reference (rather than by value). If you are not doing it yet – chances are that you’re doing something wrong. Note that since C++11, due to copy elision (which is not mandated but is a de-facto standard by modern compilers), it is usually better performance-wise to pass classes/structs/collections by value – but ONLY if you’re going to create a copy of the relevant parameter within your function anyway (see discussion in [StackOverflow.PassByValue]). Note that returning values via non-const reference parameters (essentially output parameters), while still popular, gets out of favour and with move constructors is rarely really necessary

using initialization over assignment a special case: using “: str(s)” instead of “str=s;” in constructors to initialize string and other allocated data members

specifying “move constructors” and “move assignment operators” where applicable specifying them as noexcept wherever applicable



[[To Be Continued…

This concludes beta Chapter 16(b) from the upcoming book “Development and Deployment of Multiplayer Online Games (from social games to MMOFPS, with social games in between)”. Stay tuned for beta Chapter 16(c), dedicated to some of C++ “best practices” (well, at least as I understand them ;-)).]]

Acknowledgement

Cartoons by Sergey Gordeev from Gordeev Animation Graphics, Prague.