One of my current projects is a sum type implementation for the D programming language. It bills itself as having "zero overhead compared to hand-written C," and while there are sound theoretical reasons to think that's true, I'd never actually sat down and verified it empirically. Was I really getting the optimal performance I thought I was?

I was also curious to see how my efforts stacked up next to C++17's <variant> and D's Algebraic , especially after reading a few blog posts and reddit comments about the challenges faced by the C++ implementation. Could D, and my library, SumType , in particular, do better?

Methodology

To answer these questions, I designed a simple program and implemented it in C, in C++, and in D with both Algebraic and SumType , using the most natural and idomatic code for each language and library. My goal was to find out how well a production-grade, optimizing compiler would handle each sum type implementation in typical use. The compilers I chose were clang 8, clang++ 8, and ldc 1.17.0, all of which use LLVM 8 for their backends. All programs were compiled with optimization level -O2 .

Each test program does the following things:

Defines 10 empty struct types, named T0 through T9 .

types, named through . Defines a sum type with 10 members, one of each of those types.

Defines a function that takes an instance of that sum type as an argument, and returns a unique integer for each possible type the sum type can hold.

For illustration's sake, here's the C++ version, slightly abridged:

#include <variant> // Source: https://en.cppreference.com/w/cpp/utility/variant/visit template < class ... Ts > struct overloaded : Ts ... { using Ts :: operator ()...; }; template < class ... Ts > overloaded ( Ts ...) -> overloaded < Ts ... > ; struct T0 {}; // ... struct T9 {}; using example_variant = std :: variant < T0 , T1 , T2 , T3 , T4 , T5 , T6 , T7 , T8 , T9 > ; int do_visit ( example_variant v ) { return std :: visit ( overloaded { []( T0 val ) { return 3 ; }, []( T1 val ) { return 5 ; }, []( T2 val ) { return 8 ; }, // ... []( T9 val ) { return 233 ; }, }, v ); }

After compiling each program, I used objdump to disassemble them, and compared the generated assembly for each version of do_visit .

All of the code used, along with the commands used for compilation and disassembly, are available on Github in the repository pbackus/variant-comparison.

Results

C

The C version uses a union with an enum tag as its sum type implementation, and a switch statement for its "visit" function. It's the straightforward, obvious implementation you'd use if you were writing the code by hand, without worrying about making it generic, so it makes a good baseline to compare the other versions to.

00000000000000a0 <do_visit>: a0: cmp $0xa,%edi # check if type index is in-bounds a3: jae b0 a5: movslq %edi,%rax a8: mov 0x0(,%rax,4),%eax # get result from global array af: retq # Error path b0: push %rax b1: mov $0x0,%edi b6: mov $0x0,%esi bb: mov $0x0,%ecx c0: mov $0x45,%edx c5: callq ca # __assert_fail

As you might expect, the compiler is able to optimize the entire thing down into a single array lookup. It's so compact, in fact, that the code for calling libc 's assertion-failure function takes up more space than the actual logic.

C++

The C++ version uses std::variant and std::visit , with the overloaded helper template from cppreference.com to allow passing a set of labmda functions as a visitor. This is standard-library code, so we should expect it to be as well-optimized as the best and brightest C++ developers around can make it.

0000000000000000 <do_visit(std::__1::example_variant)>: 0: sub $0x18,%rsp 4: mov %rdi,%rax 7: mov %rdi,0x8(%rsp) c: shr $0x20,%rax 10: mov $0xffffffff,%ecx 15: cmp %rcx,%rax # check if type index is in-bounds 18: je 38 1a: mov %rsp,%rcx 1d: mov %rcx,0x10(%rsp) 22: lea 0x10(%rsp),%rdi 27: lea 0x8(%rsp),%rsi 2c: callq *0x0(,%rax,8) # call function pointer in global array 33: add $0x18,%rsp 37: retq # Error path 38: callq 3d # __throw_bad_variant_access

The main thing to notice here is that, unlike in the C version, the compiler is not able to inline the individual visitor functions. Instead, it generates an indirect call to a function pointer stored in a global array. (The individual lambda functions, of course, all consist of a single mov followed by a return.)

I'm not enough of a C++ expert to understand the implementation of std::visit , so I can only guess why this is the case. Maybe the optimizer just gives up after too many layers of template-metaprogramming gunk? Regardless, it's bad news for fans of zero-cost abstractions, since there's a clear overhead here compared to the C assembly.

[UPDATE: Reddit user matthieum suggests that the function pointers themselves are likely to blame.]

D, with Algebraic

This version uses the sum type implementation from D's standard library, Phobos. Rather than create a dedicated sum type, the Phobos developers opted to make Algebraic a thin wrapper around Variant , D's equivalent of C++17's std::any . That choice turns out to have far-reaching repercussions, as we're about to see.

0000000000000000 <int dalgebraic.do_visit(ExampleAlgebraic)>: 0: jmpq 5 # int std.variant.visitImpl!(true, ExampleAlgebraic, ...) 0000000000000000 <int std.variant.visitImpl!(true, ExampleAlgebraic, ...) 0: push %rbx 1: sub $0x10,%rsp 5: cmp 0x0(%rip),%rdi # check for uninitialized Variant c: je 217 # if so, go to error path 12: mov %rdi,%rbx # This part is repeated 10 times, once for each type 15: movq $0x0,0x8(%rsp) # prepare arguments for VariantN.handler 1e: lea 0x8(%rsp),%rdi 23: xor %esi,%esi 25: xor %edx,%edx 27: callq *%rbx # VariantN.handler: get TypeInfo for current type 29: mov 0x8(%rsp),%rsi 2e: mov 0x0(%rip),%rdi # get TypeInfo for T0 35: callq 3a # Object.opEquals: compare TypeInfos 3a: mov %eax,%ecx 3c: mov $0x3,%eax # load return value for T0 into %eax 41: test %cl,%cl # check if Object.opEquals returned true 43: jne 211 # if so, go to return ...: ... ... # After checking for T9 20f: je 283 # if none of the types matched, assert(false) 211: add $0x10,%rsp 215: pop %rbx 216: retq # Error path 217: mov 0x0(%rip),%rdi # get ClassInfo for VariantException 21e: callq 223 # _d_allocclass: allocate VariantException 223: mov %rax,%rbx 226: mov 0x0(%rip),%rax # initialize VariantException vtable 22d: mov %rax,(%rbx) 230: movq $0x0,0x8(%rbx) 238: mov 0x0(%rip),%rax # initialize VariantException 23f: movups 0x50(%rax),%xmm0 243: movups %xmm0,0x50(%rbx) 247: movups 0x10(%rax),%xmm0 24b: movups 0x20(%rax),%xmm1 24f: movups 0x30(%rax),%xmm2 253: movups 0x40(%rax),%xmm3 257: movups %xmm3,0x40(%rbx) 25b: movups %xmm2,0x30(%rbx) 25f: movups %xmm1,0x20(%rbx) 263: movups %xmm0,0x10(%rbx) 267: lea 0x0(%rip),%rdx # get message for VariantException 26e: mov $0x2f,%esi 273: mov %rbx,%rdi 276: callq 27b # VariantException.__ctor 27b: mov %rbx,%rdi 27e: callq 283 # _d_throw_exception 283: lea 0x0(%rip),%rsi # get error message for assert 28a: mov $0x4b,%edi 28f: mov $0xa1c,%edx 294: callq 299 # call D runtime assert function

The good part is that ldc is able to inline the lambdas into the body of visitImpl . The bad part is, well, everything else.

Because Algebraic is implemented using Variant , it has to rely on runtime type information ( TypeInfo ) to check the type of the current value. And because it uses TypeInfo , rather than an integer index like the C and C++ versions, these checks can't be condensed down into a single array lookup, or even a jump table. Finally, as if that wasn't enough, each TypeInfo comparison requires a virtual method call to Object.opEquals .

Beyond these obvious pessimizations, the other thing that stands out is the sheer volume of the generated code. It's an order of magnitude larger than both the C and C++ versions, with significant bloat in both the normal and error paths. Not only is this bad in isolation, since it puts unnecessary pressure on the instrution cache, it also means that potential optimization opportunities exposed by inlining the C and C++ versions of do_visit won't be available to D code using Algebraic .

D, with SumType

This version uses my own implementation of a sum type in D. SumType does not rely at all on runtime type information, and instead uses D's powerful compile-time reflection and metaprogramming capabilities to try to generate code as close to the C version as possible. Let's see how well it does.

0000000000000000 <int dsumtype.do_visit(ExampleSumType)>: 0: push %rax 1: movsbq 0x10(%rsp),%rax # get type index from stack 7: cmp $0xa,%rax # check if type index is in-bounds b: jae 19 d: lea 0x0(%rip),%rcx # load address of global array 14: mov (%rcx,%rax,4),%eax # get result from global array 17: pop %rcx 18: retq # Error path 19: lea 0x0(%rip),%rdx # get error message 20: mov $0x4ca,%edi 25: mov $0x9,%esi 2a: callq 2f # __switch_error

As it turns out, it's almost exactly the same as the C version: a single array lookup, plus some code for error handling. In fact, as far as I can tell, the only difference has to do with the way the function argument is passed in registers—the C version is able to avoid spilling anything to the stack, whereas this version has a push and a pop at the beginning and end, respectively.

Still, I think it's fair to say that this is a true zero-cost abstraction, and that the claim of "zero overhead compared to hand-written C" is justified.

[UPDATE: Reddit user ais52 explains that the purpose of the push / pop pair is to adjust the alignment of the stack pointer.]

Conclusions

If creating a zero-overhead generic sum type is so easy that even I can do it, why doesn't C++ have one? Are the libc++ developers a bunch of dummies?

No, of course not—but they are working with a significant handicap: C++.

The source code for variant is terrifyingly complex. In order to get the results shown above, the authors have had to use every template metaprogramming trick in the book, and then some. It's clearly the result of immense effort by some very skilled programmers.

By contrast, the source code of SumType is almost embarrassingly simple. The reason match , SumType 's equivalent of std::visit , is able to optimize down to the same code as a C switch statement is that it literally is a switch statement, with a static foreach loop inside to generate the cases.