A couple of years ago I wrote an article about the Curiously Recurring Template Pattern in C++, focusing on the motivation behind it and how to implement it.

That article mentioned runtime performance as the main reason for employing CRTP instead of the more traditional runtime polymorphism (dispatch via virtual functions). While some rationale for the cost of virtual calls was given, I didn't go too deep into it. Today I want to fix that by carefully analyzing the performance of virtual calls as opposed to the static calls made possible by CRTP.

Mandatory precaution about benchmarks Benchmarking in 2013 is really hard. Today's CPUs are super-pipelined, branch-predicting out-of-order executing beasts. The memory hierarchy is very deep and the caches have complex behavior. All of this makes detailed performance analysis devilishly complex, and the results are sometimes baffling. We're clearly long past counting MIPS. Add to that overly-clever optimizing compilers that occasionally produce not quite the code you expected, and it's apparent why so many online resources and articles provide bad benchmarks. So any benchmarks need to be taken with a large grain of salt, including the one posted here. Personally, I'm trying to validate the benchmarks I'm running by attacking them with the scientific method: First, create a hypothesis about the relative speed of two approaches. Take a detailed look at the code generated by the compiler to verify the hypothesis w.r.t. code generation - is this the machine code you expected to see? Run the benchmark and compare the runtime to the initial hypothesis, as well as to (2) - while not perfect, performance is easier to correlate to machine code than to original source code. If anything doesn't feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2).

Hypothesis - what makes virtual calls slower The previous article listed the following components in the runtime cost of virtual calls: Extra indirection (pointer dereference) for each call to a virtual method.

Virtual methods usually can’t be inlined, which may be a significant cost hit for some small methods.

Additional pointer per object. On 64-bit systems which are prevalent these days, this is 8 bytes per object. For small objects that carry little data this may be a serious overhead. While the third component can definitely play a role in some scenarios (i.e. a lot of small objects where the additional memory means less of them fit into L1 data cache), I'll focus on the first two in this article, because they are easier to expose in a simple synthetic benchmark.

The source code - what are we comparing? There's a plethora of uses for polymorphism in C++. Here I'll focus on a basic one that will let me expose the performance characteristics of virtual calls. I'll define a simple interface with a couple of methods and one implementation of it: class DynamicInterface { public : virtual void tick(uint64_t n) = 0 ; virtual uint64_t getvalue() = 0 ; }; class DynamicImplementation : public DynamicInterface { uint64_t counter; public : DynamicImplementation() : counter( 0 ) { } virtual void tick(uint64_t n) { counter += n; } virtual uint64_t getvalue() { return counter; } }; The following code runs the actual benchmark: const unsigned N = 40000 ; void run_dynamic(DynamicInterface* obj) { for ( unsigned i = 0 ; i < N; ++i) { for ( unsigned j = 0 ; j < i; ++j) { obj->tick(j); } } } What this does is simply invoke the virtual method tick on the base pointer obj in the order of O(N^2) times. The alternative statically-polymorphic implementation is this : template < typename Implementation> class CRTPInterface { public : void tick(uint64_t n) { impl().tick(n); } uint64_t getvalue() { return impl().getvalue(); } private : Implementation& impl() { return * static_cast <Implementation*>( this ); } }; class CRTPImplementation : public CRTPInterface<CRTPImplementation> { uint64_t counter; public : CRTPImplementation() : counter( 0 ) { } void tick(uint64_t n) { counter += n; } uint64_t getvalue() { return counter; } }; template < typename Implementation> void run_crtp(CRTPInterface<Implementation>* obj) { for ( unsigned i = 0 ; i < N; ++i) { for ( unsigned j = 0 ; j < i; ++j) { obj->tick(j); } } }

Generated code - how virtual calls look under the hood Now let's spend some time studying the machine code generated by gcc -O2 (version 4.8) from the code above. The code for DynamicImplementation::tick is very compact: 0000000000400 cf0 <_ZN21DynamicImplementation4tickEm>: 400 cf0: add %rsi, 0x8 (%rdi) 400 cf4: retq To understand what this means, some familiarity with the Itanium C++ ABI is required. The ABI in this case mandates both the name mangling that produces the weird symbol name, and the layout of the object in memory, which mandates how its fields are accessed. Here's a short description for the code above: Since DynamicInterface has virtual methods, the class hierarchy it begets comes with a virtual method table, a pointer to which resides in each object. This is the way the compiler arranges for the runtime code to call the correct method when an actual object is used. The address of the virtual method table ( vptr ) is in the beginning of the object, and the actual class members come afterwards. So counter lives at offset 8 in DynamicImplementation objects. add %rsi, 0x8 (%rdi) %rdi is the first argument to tick , which is the hidden this pointer - the address of the object. Hence 0x8(%rdi) is the address of this->counter . The instruction, then, adds n (passed in %rsi according to the calling convention) to this->counter . By the way, whenever you're curious about object layouts and want to verify your understanding of the ABI, I find Clang's ability to dump the class record layouts very helpful. In this case: *** Dumping AST Record Layout 0 | class DynamicImplementation 0 | class DynamicInterface (primary base) 0 | (DynamicInterface vtable pointer) 8 | uint64_t counter | [sizeof=16, dsize=16, align=8 | nvsize=16, nvalign=8] *** Dumping AST Record Layout 0 | class CRTPImplementation 0 | class CRTPInterface<class CRTPImplementation> (base) (empty) 0 | uint64_t counter | [sizeof=8, dsize=8, align=8 | nvsize=8, nvalign=8] On to the invocation of tick now. This is the disassembly for run_dynamic , annotated with comments: 0000000000400 c10 <_Z11run_dynamicP16DynamicInterface>: 400 c10: push %r13 400 c12: mov $0x1,%r13d 400 c18: push %r12 // r12d holds i, initialized to 0 400 c1a: xor %r12d,%r12d 400 c1d: push %rbp // Place obj in %rbp 400 c1e: mov %rdi,%rbp 400 c21: push %rbx 400 c22: sub $0x8,%rsp 400 c26: nopw %cs: 0x0 (%rax,%rax, 1 ) 400 c30: test %r12d,%r12d // when i is 0, the body of the loop won't run, so increment // both i and j and try again. 400 c33: je 400 c5e // rbx holds j, initialized to 0 400 c35: xor %ebx,%ebx 400 c37: nopw 0x0 (%rax,%rax, 1 ) // Place the address of obj's vtable in rax 400 c40: mov 0x0 (%rbp),%rax // j is the second argument of tick 400 c44: mov %rbx,%rsi // j++ 400 c47: add $0x1,%rbx // obj is the first argument of tick ('this' pointer) 400 c4b: mov %rbp,%rdi // tick is the first entry in the vtable. // This calls obj->tick(obj, j) 400 c4e: callq *(%rax) // Compare j < i and perform inner loop 400 c50: cmp %ebx,%r12d 400 c53: ja 400 c40 // Compare i == 40000 and perform outer loop 400 c55: cmp $0x9c40,%r13d 400 c5c: je 400 c68 400 c5e: add $0x1,%r13d 400 c62: add $0x1,%r12d 400 c66: jmp 400 c30 400 c68: add $0x8,%rsp 400 c6c: pop %rbx 400 c6d: pop %rbp 400 c6e: pop %r12 400 c70: pop %r13 400 c72: retq 400 c73: data32 data32 data32 nopw %cs: 0x0 (%rax,%rax, 1 ) The interesting parts here are: How obj->tick is actually invoked. Since tick is the first method in DynamicInterface , it sits in the first slot in the vtable. So to actually call it, we have a double indirection from obj - one to get to the vtable, the other to get to the method in the vtable. The constituents of the inner loop - the part that the program spends the vast majority of its time executing. We'll get back to it for a more careful analysis later.

How CRTP calls look under the hood Now it's time to disassemble the equivalent code that uses CRTP for static polymorphism. Again, we'll want to start with CRTPImplementation::tick , but we won't find it in the disassembly because it was fully inlined into run_crtp . The compiler was able to inline it because it could know statically (at compile time) which method is called. Such inlining is an important tenet of the "zero-cost abstractions" philosophy of modern C++. Let's go straight to run_crtp , then: 0000000000400 d00 <_Z8run_crtpI18CRTPImplementationEvP13CRTPInterfaceIT_E>: // Place obj->counter into rdx 400 d00: mov (%rdi),%rdx 400 d03: mov $0x1,%esi // rcx holds i, initialized to 0 400 d08: xor %ecx,%ecx 400 d0a: nopw 0x0 (%rax,%rax, 1 ) 400 d10: test %ecx,%ecx 400 d12: je 400 d36 // rax holds j, initialized to 0 400 d14: xor %eax,%eax 400 d16: nopw %cs: 0x0 (%rax,%rax, 1 ) // counter += j 400 d20: add %rax,%rdx // j++ and perform inner loop 400 d23: add $0x1,%rax 400 d27: cmp %eax,%ecx 400 d29: ja 400 d20 400 d2b: cmp $0x9c40,%esi // when we're done, put the final value back into obj->counter 400 d31: mov %rdx,(%rdi) 400 d34: je 400 d3e 400 d36: add $0x1,%esi 400 d39: add $0x1,%ecx 400 d3c: jmp 400 d10 400 d3e: repz retq It's not hard to see we'd expect this code to run much faster, for two main reasons: Since the tick dispatch was inlined, the compiler was free to see that all it does is a simple member increment. The member is then saved in rdx and the loop can then simply bump a register, instead of having a call on each iteration. As there's no call involved, the inner loop is shorter.

Performance numbers As expected, the CRTP approach is much faster. The benchmark above takes 1.25 seconds on my i7-4771 CPU for run_dynamic and 0.21 seconds for run_crtp This is a huge difference, and it's much larger than I expected. I was looking for a 2x boost, not 6x . So here comes the 4th bullet of the benchmarking methodology I outlined above. Let's look more carefully at the numbers. I'll start with producing a trace of the inner loop for both cases, to see the sequence of instructions executed. Since the loop is short, this can be easily done with basic disassembly reading, and also verifying with gdb by stepping through the execution for a few iterations. Here is the inner loop for run_dynamic : 400 c40: mov 0x0 (%rbp),%rax 400 c44: mov %rbx,%rsi 400 c47: add $0x1,%rbx 400 c4b: mov %rbp,%rdi 400 c4e: callq *(%rax) ... calls tick 400 ce0: add %rsi, 0x8 (%rdi) 400 ce4: retq 400 c50: cmp %ebx,%r12d 400 c53: ja 400 c40 How many times we'd expect it to run? The double loop has a simple summing pattern so we can compute it's in the vicinity of N/2 * N , which in our case means 800e6 (800 million times). Since the loop above is 9 instructions long, it means 7.2e9 instructions total. Let's look at detailed perf stat numbers for this run: Performance counter stats for 'build/vcall-benchmark d': 1253.807247 task-clock # 0.999 CPUs utilized 107 context-switches # 0.085 K/sec 0 cpu-migrations # 0.000 K/sec 318 page-faults # 0.254 K/sec 4,807,848,980 cycles # 3.835 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 7,203,771,146 instructions # 1.50 insns per cycle 2,400,716,784 branches # 1914.742 M/sec 58,358 branch-misses # 0.00% of all branches 1.255560284 seconds time elapsed Indeed, the amount of instructions fits our expectation. Now, let's turn to run_crtp . Its inner loop is this: 400d20: add %rax,%rdx 400d23: add $0x1,%rax 400d27: cmp %eax,%ecx 400d29: ja 400d20 So only 4 instructions. In other words, we'd expect the total amount of instructions executed to be in the area of 3.2e9. Let's see: Performance counter stats for 'build/vcall-benchmark c': 215.919352 task-clock # 0.997 CPUs utilized 18 context-switches # 0.083 K/sec 0 cpu-migrations # 0.000 K/sec 318 page-faults # 0.001 M/sec 809,355,502 cycles # 3.748 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 3,202,645,106 instructions # 3.96 insns per cycle 800,522,521 branches # 3707.507 M/sec 53,684 branch-misses # 0.01% of all branches 0.216596060 seconds time elapsed Bingo! But wait, a 2.25x difference in the amount of instructions should not have translated to a 6x difference in runtime, right? Note the amount of branches, though. While the CRTP run has one branch per inner loop, the numbers for the dynamic run show 3 branches per inner loop (for a total of 2.4e9). What gives? The CPU considers indirect calls and returns as branches for this purpose, and if you think about it, this makes sense. An indirect branch or return transfer control to a location the CPU cannot determine statically (unlike a direct call, for instance) - it depends on the contents of registers & stack. So the CPU doesn't know where to fetch instructions ahead-of-time in order to satisfy its eternally hungry super-pipeline. True, the branch predictor alleviates most of that cost, but such instructions are still more expensive for the CPU than, say, simple adds, because they cannot pump through the pipeline as quickly. Moreover, the call and ret instructions push and pop data to the stack, which resides in memory. It's almost certainly in L1 cache, but that's still more expensive to access than registers.

Variation: -O3 compilation Vigilant readers might have noticed that I did not set the highest optimization level of gcc for this benchmark. This was done on purpose, to make the results simpler to explain. When compiled with -O3 , the dynamic version runs as before (and the code produced for it is the same), but the CRTP version runs even faster and finishes within 0.17 seconds, which is 7.2x faster than the dynamic version. The extra boost comes from auto-vectorization. When one looks at the code produced by the compiler for run_crtp , one can see SIMD instructions in there. The inner loop was unrolled 4x and the operations are performed on whole quad-words, combining several inner loop iterations at a time. So this is an example where previous optimizations (inlining) enabled the compiler to apply even more advanced optimizations such as vectorization to make the code faster yet.

Variation: disabling inlining It's also interesting to build the benchmark with -fno-inline and compare the results. Curiously, in this case the CRTP approach runs 1.5x slower than virtual calls. Before you read on, can you guess why? The reason is quite simple. Note that for proper CRTP, the interface class implements the interface methods and calls through to the implementation. So to actually invoke tick , run_crtp calls: CRTPInterface<CRTPImplementation>::tick , which calls CRTPInterface<CRTPImplementation>::impl , and then calls

, and then calls CRTPImplementation::tick This is a lot of calls, which all have to be executed when the inliner is turned off. When it's turned on, all of these calls get inlined and the actual instructions of the leaf call are embedded into run_crtp . There are two lessons here: Be careful with benchmarking, but you already knew that ;-) When implementing inlining in a compiler, it's super important to make the inliner iterative - doing multiple passes on the code and discovering new inlining opportunities in each iteration.

Devirtualization A brand new optimization that I recently heard about is devirtualization. The idea is to find cases of dynamic dispatch where the actual type at a given call site can always to proven to be known at compile time, and specialize those call sites to dispatch statically. This carries the promise of making virtual calls as fast as static dispatch in some special cases. While this definitely sound interesting, at the time of writing this article devirtualization is still experimental (support in gcc started trickling in version 4.7). In any case, the example examined in this article is probably simple enough to trigger the optimization, but as you can see it didn't happen, even though the -fdevirtualize flag should be on in gcc with optimization levels -O2 and -O3 . It will be interesting to follow the development of this optimization and see what cases of virtual calls it can detect and optimize in the future.