Back in November 2018, following a thread on the std-proposals mailing list, I sat down and wrote out a really simple “wide integer” library.

“Wide integers” are integers of fixed-but-bigger-than-word-length size. For example, the __uint128_t supported by GCC and Clang would be considered a “wide” integer. Wide integers are not to be confused with “big integers,” i.e., integers of dynamic (heap-allocated) size.

I can also imagine “wide floating-point,” “wide fixed-point,” “wide rational” (that is, a number expressed and stored as the ratio of two wide integers), and “big rational.” I doubt that “big fixed-point” or “big floating-point” make philosophical sense.

My library provides a single class template: Wider<T> . The T can be either uint64_t , or some specialization of Wider . Each Wider doubles the bit-width of the unsigned integer represented; so we can say

using Uint128 = Wider<uint64_t>; using Uint256 = Wider<Uint128>; using Uint512 = Wider<Uint256>; using Uint1024 = Wider<Uint512>;

I looked at Clang’s codegen for __uint128_t to see what might be the most optimal codegen, and then tweaked my code to try to achieve the same codegen. It was actually pretty easy, on Clang/LLVM! On GCC, not so much.

The idea is that we can write plain old C++ code — maybe with a few intrinsics to help out at the lowest level — and then let the compiler do all the hard work of optimizing it. Here’s the code for wide-integer addition:

inline bool producecarry(uint64_t& x, uint64_t y) { x += y; return (x < y); } inline bool addcarry(bool cf, uint64_t& x, uint64_t y) { return _addcarry_u64(cf, x, y, (unsigned long long*)&x); } template<class Int64> struct Wider { Int64 lo; Int64 hi; friend bool producecarry(Wider& x, const Wider& y) { return addcarry(producecarry(x.lo, y.lo), x.hi, y.hi); } friend bool addcarry(bool cf, Wider& x, const Wider& y) { cf = addcarry(cf, x.lo, y.lo); cf = addcarry(cf, x.hi, y.hi); return cf; } friend Wider& operator+=(Wider& x, const Wider& y) { (void)producecarry(x, y); return x; } friend Wider operator+(Wider x, const Wider& y) { x += y; return x; } };

Compile this with Clang trunk and we see that the compiler produces the exact same code for

Uint128 example(const Uint128& x, const Uint128& y) { return x + y; } __uint128_t example(const __uint128_t& x, const __uint128_t& y) { return x + y; }

And that code is optimal:

movq (%rsi), %rax addq (%rdi), %rax movq 8(%rsi), %rdx adcq 8(%rdi), %rdx retq

GCC trunk, on the other hand, performs abysmally.

Anywhere there’s a gap between the performance of Wider<T> and the performance of a built-in type like __uint128_t , that’s an opportunity for some compiler writer to go improve the codegen. In this way, WideIntProofOfConcept is kind of like the “Stepanov Abstraction Penalty Benchmark”; it points to places where the compiler could do better at recognizing peephole-level idioms.

Back in 2018–2019 I filed a bunch of bugs against LLVM inspired by this benchmark: 39968, 40090, 24545, 31754, 40486, 40825. They were all fixed and closed quite expeditiously!

Bug 40908, an ICE on using __int128 as an NTTP, I actually discovered while writing “Is __int128 integral? A survey” (2019-02-28). It remains open.

To see the current “penalties” for using Wider<T> versus the built-in types, see the README on GitHub. There’s also a Python script you can download to regenerate the table using Godbolt’s API, if you want to try other (Godbolt-supported) compilers, or just to see whether my numbers are up-to-date.

Finally, thanks to Niall Douglas for (two years ago) motivating this whole thing by pointing out how absymally awful Boost.Multiprecision is! Here is a reduced version of Niall’s benchmark for a + b , showing boost::multiprecision::uint128_t alongside native unsigned __int128 ; Wider<uint64_t> ; and the new kid on the block, Abseil’s absl::uint128 .