Introduction

In 2015, I wrote an article titled Variant Visitation which described an implementation strategy for std::visit . The approach involved a matrix of function pointers, and many have raised concerns regarding poor code-gen caused by optimization limitations of function pointers on some compilers.

Here are a few that I remember:

Anyway, it’s been on my list of things to do for a while. The motivation for action didn’t come however until a few months ago when Nir Friedman (quicknir) filed an issue on mpark/variant and generously offered to help with the effort!

This post describes the switch -based approach implemented in mpark/variant, and its benchmark results.

Variant

Let us get started with a minimal variant interface. This will give us a frame of reference throughout the post. We only need the following:

v.index() : Returns the index of the currently stored value.

: Returns the of the currently stored value. std::get<I>(v) : Returns the currently stored value if v.index() == I else throws std::bad_variant_access .

Visitation

Here is the signature of visit :

template < typename F , typename ... Vs > constexpr decltype ( auto ) visit ( F && f , Vs && ... vs );

It performs f(std::get<Is>(vs)...) where Is... are the compile-time values of vs.index()... .

Implementation

For single visitation in Variant Visitation, we generate a 1-dimensional array of function pointers where the function pointer at I performs f(std::get<I>(v)) . We can then invoke the function pointer at v.index() .

In this implementation, we want to switch on v.index() instead and invoke f(std::get<I>(v)) directly rather than through a function pointer. This should facilitate compilers that have trouble inlining function pointers.

Here’s a limited but simple example on GCC 8.2:

Single Visitation: Exactly 4 Alternatives

Let’s start with the following snippet:

template < typename F , typename V > constexpr decltype ( auto ) visit ( F && f , V && v ) { switch ( v . index ()) { case 0 : { return std :: invoke ( std :: forward < F > ( f ), std :: get < 0 > ( std :: forward < V > ( v ))); } case 1 : { return std :: invoke ( std :: forward < F > ( f ), std :: get < 1 > ( std :: forward < V > ( v ))); } case 2 : { return std :: invoke ( std :: forward < F > ( f ), std :: get < 2 > ( std :: forward < V > ( v ))); } case 3 : { return std :: invoke ( std :: forward < F > ( f ), std :: get < 3 > ( std :: forward < V > ( v ))); } } }

As discussed, we switch on v.index() and perform f(std::get<I>(v)) . This however only works for variant s with exactly 4 alternatives 😅.

Single Visitation: Up to 4 Alternatives

To do this, we need to detect and disable the invalid branches. We’ll introduce a struct dispatcher templated on bool IsValid , where the false specialization contains the invalid code path declared __builtin_unreachable(); so that compilers know they can safely optimize it out, and the true specialization performs the valid code paths.

template < bool IsValid , typename R > struct dispatcher ; template < typename R > struct dispatcher < false , R > { template < std :: size_t I , typename F , typename V > static R case_ ( F && , V && ) { __builtin_unreachable (); } }; template < typename R > struct dispatcher < true , R > { template < std :: size_t I , typename F , typename V > static R case_ ( F && f , V && v ) { using Expected = R ; using Actual = decltype ( std :: invoke ( std :: forward < F > ( f ), std :: get < I > ( std :: forward < V > ( v )))); static_assert ( std :: is_same_v < Expected , Actual > , "`mpark::visit` requires the visitor to have a single return type" ); return std :: invoke ( std :: forward < F > ( f ), std :: get < I > ( std :: forward < V > ( v ))); } };

In visit , we call this conditional case_ function by checking whether the case index is within std::variant_size_v . If the case is valid, we invoke f with std::get<I>(v) as before. Otherwise it’s unreachable.

template <typename F, typename V> constexpr decltype(auto) visit(F &&f, V &&v) { + using R = decltype( + std::invoke(std::forward<F>(f), std::get<0>(std::forward<V>(v)))); + constexpr std::size_t size = std::variant_size_v<std::remove_cvref_t<V>>; switch (v.index()) { case 0: { - return std::invoke(std::forward<F>(f), std::get<0>(std::forward<V>(v))); + return dispatcher<0 < size, R>::template case_<0>( + std::forward<F>(f), std::forward<V>(v)); } case 1: { - return std::invoke(std::forward<F>(f), std::get<1>(std::forward<V>(v))); + return dispatcher<1 < size, R>::template case_<1>( + std::forward<F>(f), std::forward<V>(v)); } case 2: { - return std::invoke(std::forward<F>(f), std::get<2>(std::forward<V>(v))); + return dispatcher<2 < size, R>::template case_<2>( + std::forward<F>(f), std::forward<V>(v)); } case 3: { - return std::invoke(std::forward<F>(f), std::get<3>(std::forward<V>(v))); + return dispatcher<3 < size, R>::template case_<3>( + std::forward<F>(f), std::forward<V>(v)); } } }

At this point, we can visit variant s with upto 4 alternatives. Next let’s enable visiting variant s with arbitrary number of alternatives.

Single Visitation: Arbitrary # of Alternatives

In the above example, we only handled cases 0 through 3. Since there is no such thing as a variadic switch statement, we need to instead use recursion to handle the next range of values 4 through 7, and continue on until we’ve covered all of the valid cases.

We’ll do this by modifying the switch statement to handle the same range of values but adjusted by a base number. That is, given a base B , each recursive call will handle B+0 , B+1 , B+2 , B+3 . The initial value for B will be 0 , and each recursive call will increment B by 4 . We can continue while B is within a value range, i.e., B is within std::variant_size_v .

That means we have another pair of valid/invalid code paths to consider. Let’s call it switch_ , to complement the case_ we already have:

template <bool IsValid, typename R> struct dispatcher; template <typename R> struct dispatcher<false, R> { template <std::size_t I, typename F, typename V> static constexpr R case_(F &&, V &&) { __builtin_unreachable(); } + template <std::size_t B, typename F, typename V> + static constexpr R switch_(F &&, V &&) { __builtin_unreachable(); } }; template <typename R> struct dispatcher<true, R> { template <std::size_t I, typename F, typename V> static constexpr R case_(F &&f, V &&v) { using Expected = R; using Actual = decltype( std::invoke(std::forward<F>(f), std::get<I>(std::forward<V>(v)))); static_assert( std::is_same_v<Expected, Actual>, "`mpark::visit` requires the visitor to have a single return type"); return std::invoke(std::forward<F>(f), std::get<I>(std::forward<V>(v))); } + template <std::size_t B, typename F, typename V> + static constexpr R switch_(F &&f, V &&v) { - using R = decltype( - std::invoke(std::forward<F>(f), std::get<0>(std::forward<V>(v)))); constexpr std::size_t size = std::variant_size_v<std::remove_cvref_t<V>>; switch (v.index()) { case B+0: { - return dispatcher<0 < size, R>::template case_<0>( + return dispatcher<B+0 < size, R>::template case_<B+0>( std::forward<F>(f), std::forward<V>(v)); } case B+1: { - return dispatcher<1 < size, R>::template case_<1>( + return dispatcher<B+1 < size, R>::template case_<B+1>( std::forward<F>(f), std::forward<V>(v)); } case B+2: { - return dispatcher<2 < size, R>::template case_<2>( + return dispatcher<B+2 < size, R>::template case_<B+2>( std::forward<F>(f), std::forward<V>(v)); } case B+3: { - return dispatcher<3 < size, R>::template case_<3>( + return dispatcher<B+3 < size, R>::template case_<B+3>( std::forward<F>(f), std::forward<V>(v)); } + default: { + return dispatcher<B+4 < size, R>::template switch_<B+4>( + std::forward<F>(f), std::forward<V>(v)); + } } + } };

With this mechanism in place, visit simply bootstraps the recursion.

template <typename F, typename V> constexpr decltype(auto) visit(F &&f, V &&v) { using R = decltype( std::invoke(std::forward<F>(f), std::get<0>(std::forward<V>(v)))); + return dispatcher<true, R>::template switch_<0>(std::forward<F>(v), + std::forward<V>(v)); }

Great! We’ve got single visitation supporting arbitrary number of alternatives. While I showed each recursive level handling 4 cases, the actual implementation handles 32 cases at a time.

Multi-visitation

Even though multi-visitation is currently implemented with a generalized version of this technique, I’ll leave the details for another post since there’s more work to be done. Specifically, I want to investigate some compile-time improvements and add some benchmarks for runtime performance.

Benchmark

The following benchmarking code is what the measurements are based on, and is inspired by the one that Björn Fahller presented in his blog post: When performance guarantees hurts performance - std::visit.

The benchmark creates a vector of approximately 5000 variant s with N alternatives, and sums up the compile-time indices that are assigned to each variant. The idea being that if the visitor is inlined for the switch and out-of-line for jump table, switch should perform better.

#include <benchmark/benchmark.h> #include <cstddef> #include <type_traits> #include <vector> #include <mpark/variant.hpp> template < std :: size_t I > using size_constant = std :: integral_constant < std :: size_t , I > ; template < std :: size_t ... Is > auto make_variants ( lib :: index_sequence < Is ... > ) { std :: vector < mpark :: variant < size_constant < Is > ... >> vs ; for ( std :: size_t i = 0 ; i < 5000 / sizeof ...( Is ); ++ i ) { int dummy [] = {( vs . push_back ( size_constant < Is > {}), 0 )...}; ( void ) dummy ; } return vs ; } struct visitor { template < std :: size_t I > constexpr std :: size_t operator ()( size_constant < I > ) const { return I ; } }; template < std :: size_t N > static void visit1 ( benchmark :: State & state ) { auto vs = make_variants ( lib :: make_index_sequence < N > {}); for ( auto _ : state ) { int sum = 0 ; for ( const auto & v : vs ) { sum += mpark :: visit ( visitor {}, v ); } benchmark :: DoNotOptimize ( sum ); } } BENCHMARK_TEMPLATE ( visit1 , 1 ); BENCHMARK_TEMPLATE ( visit1 , 2 ); BENCHMARK_TEMPLATE ( visit1 , 3 ); BENCHMARK_TEMPLATE ( visit1 , 5 ); // ... BENCHMARK_TEMPLATE ( visit1 , 200 );

This benchmark was only tested with GCC and Clang. GCC 4.8/4.9/5/6/7/8 and Clang 3.8/3.9/4.0/5.0/6.0/7 under each of C++11/14/17, as supported.

The following is the result for Clang 7 under C++17:

Jump Table

--------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------- visit1<1> 2847 ns 2847 ns 244887 visit1<2> 11459 ns 11455 ns 61324 visit1<3> 10127 ns 10125 ns 68804 visit1<5> 10607 ns 10604 ns 66159 visit1<8> 11363 ns 11360 ns 61668 visit1<15> 12432 ns 12431 ns 56413 visit1<32> 10590 ns 10590 ns 64951 visit1<65> 10595 ns 10593 ns 66403 visit1<128> 21990 ns 21985 ns 31979 visit1<200> 27180 ns 27174 ns 25770

Recursive switch

--------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------- visit1<1> 2846 ns 2845 ns 246770 visit1<2> 3405 ns 3404 ns 205871 visit1<3> 2934 ns 2933 ns 239269 visit1<5> 3815 ns 3814 ns 183281 visit1<8> 2935 ns 2935 ns 238622 visit1<15> 2950 ns 2950 ns 238193 visit1<32> 2927 ns 2927 ns 239187 visit1<65> 2954 ns 2954 ns 233865 visit1<128> 9570 ns 9570 ns 73112 visit1<200> 9635 ns 9635 ns 73103

The following is the result for GCC 8 under C++17:

Jump Table

--------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------- visit1<1> 2811 ns 2811 ns 248491 visit1<2> 9494 ns 9493 ns 73727 visit1<3> 11461 ns 11460 ns 60527 visit1<5> 11487 ns 11486 ns 60689 visit1<8> 11571 ns 11571 ns 61453 visit1<15> 11592 ns 11592 ns 60979 visit1<32> 12553 ns 12553 ns 56333 visit1<65> 10630 ns 10629 ns 65991 visit1<128> 17035 ns 17034 ns 40331 visit1<200> 18484 ns 18484 ns 37944

Recursive switch

--------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------- visit1<1> 2828 ns 2828 ns 247810 visit1<2> 4203 ns 4202 ns 167167 visit1<3> 2904 ns 2904 ns 241905 visit1<5> 2858 ns 2858 ns 245063 visit1<8> 3830 ns 3830 ns 182081 visit1<15> 2854 ns 2854 ns 245068 visit1<32> 3826 ns 3826 ns 181831 visit1<65> 6920 ns 6920 ns 102445 visit1<128> 12084 ns 12084 ns 57905 visit1<200> 14330 ns 14330 ns 48923

The full matrix of the benchmark results is captured in Travis CI.

Conclusion

Clang/GCC both benefit from roughly a 3x improvement in performance from the switch -based implementation strategy. Please reach out with your ideas for further improvements to the library! Thanks to Nir Friedman (quicknir) for contributions / discussions over the past few months in making this happen.

Thanks to Agustín “K-ballo” Bergé and Barry Revzin for reviewing this post!