I build and maintain smf - the fastest RPC in the west. The main language supported is C++. We use Seastar as the actual asynchronous framework.

The RPC protocol uses flatbuffers to make cross language support possible, in addition to its main goal of providing zero cost deserialization on the receiving side. Currently, Go and Java are in the process of being added to the repo, with partial support already merged.

On Friday, June 28 2018, I got a text from a friend that said: last time I checked flatbuffers was slow vs cap 'n proto . At first, I was suspicious, since I’ve never seen flatbuffers be in any top $> perf profile of any kind, on real production applications. However, I had no numbers or benchmarks to prove it.

I thought I had a quick-and-dirty hack to amortize the cost of object graphs - it never occurred to me to measure large type trees or large buffers.

BLUF - Bottom Line Up Front.

I had introduced a performance optimization that actually turned out to be bad for large buffers.

To be precise:

On small very buffers it was ~6% faster

On large buffers it was ~41% slower

nooooooooooooooooooooooooooo!

The beginning of yak-shaving

I found a project that had benchmarked cap'n proto vs flatbuffers . In particular, it measures buffer construction, since both cap’n proto and flatbuffers deserialization is effectively a pointer cast - Yay for little endian enforced encoding into an aligned byte array.

The cpp-serializers project measures cap’n proto vs flatbuffers encoding and to my surprise, flatbuffers measured at 2.5X slower.

Their test:

auto start = std :: chrono :: high_resolution_clock :: now (); for ( size_t i = 0 ; i < iterations ; i ++ ) { flatbuffers :: FlatBufferBuilder builder ; strings . clear (); for ( size_t i = 0 ; i < kStringsCount ; i ++ ) { strings . push_back ( builder . CreateString ( kStringValue )); } auto ids_vec = builder . CreateVector ( kIntegers ); auto strings_vec = builder . CreateVector ( strings ); auto r1 = CreateRecord ( builder , ids_vec , strings_vec ); builder . Finish ( r1 ); auto p = reinterpret_cast < char *> ( builder . GetBufferPointer ()); auto sz = builder . GetSize (); std :: vector < char > buf ( p , p + sz ); auto r2 = GetRecord ( buf . data ()); ( void ) r2 -> ids ()[ 0 ]; builder . ReleaseBufferPointer (); } auto finish = std :: chrono :: high_resolution_clock :: now (); auto duration = std :: chrono :: duration_cast < std :: chrono :: milliseconds > ( finish - start ). count ();

As you can see, the benchmark creates a flatbuffers builder (a kind of std::vector ). It stresses the builder by allocating an array of strings and ints… and that’s about it.

Internally, the flatbuffers::FlatbufferBuilder encodes from top-to-bottom (downward growth) and when it reaches the bottom, it reallocs and grows the size of the underlying array to fit the contents.

The actual code for reallocation is:

void reallocate ( size_t len ) { FLATBUFFERS_ASSERT ( allocator_ ); auto old_reserved = reserved_ ; auto old_size = size (); auto old_scratch_size = scratch_size (); reserved_ += ( std :: max )( len , old_reserved ? old_reserved / 2 : initial_size_ ); reserved_ = ( reserved_ + buffer_minalign_ - 1 ) & ~ ( buffer_minalign_ - 1 ); if ( buf_ ) { buf_ = allocator_ -> reallocate_downward ( buf_ , old_reserved , reserved_ , old_size , old_scratch_size ); } else { buf_ = allocator_ -> allocate ( reserved_ ); } cur_ = buf_ + reserved_ - old_size ; scratch_ = buf_ + old_scratch_size ; }

It grows by powers of 2 (reserved_ + buffer_minalign_ - 1) & ~(buffer_minalign_ - 1); , and usually by 1024 bytes which is the default initial_size_;

Setup

Let’s start with a Flatbuffers IDL of a simple key=value struct.

table kvpair { key : string ; value : string ; }

The C++ API has a nice object api on top of the raw flatbuffers::FlatbufferBuilder , so a quick GoogleBenchmark yields something like this:

static inline kvpairT gen_kv ( uint32_t sz ) { kvpairT ret ; ret . key . resize ( sz , 'x' ); ret . value . resize ( sz , 'y' ); return ret ; } static void BM_alloc_simple ( benchmark :: State & state ) { for ( auto _ : state ) { state . PauseTiming (); auto kv = gen_kv ( state . range ( 0 )); state . ResumeTiming (); // build it! flatbuffers :: FlatBufferBuilder bdr ; bdr . Finish ( kvpair :: Pack ( bdr , & kv , nullptr )); auto mem = bdr . Release (); auto ptr = reinterpret_cast < char *> ( mem . data ()); auto sz = mem . size (); auto buf = seastar :: temporary_buffer < char > ( ptr , sz , seastar :: make_object_deleter ( std :: move ( mem ))); benchmark :: DoNotOptimize ( buf ); } } BENCHMARK ( BM_alloc_simple ) -> Args ({ 1 << 1 , 1 << 1 }) -> Args ({ 1 << 2 , 1 << 2 }) -> Args ({ 1 << 4 , 1 << 4 }) -> Args ({ 1 << 8 , 1 << 8 }) -> Args ({ 1 << 12 , 1 << 12 }) -> Args ({ 1 << 16 , 1 << 16 }) -> Args ({ 1 << 18 , 1 << 18 });

A few things to note:

We use auto mem = bdr.Release(); and wrap that into a seastar::temporary_buffer<char> with effectively zero copy (+- some pointer assignments).

and wrap that into a with effectively zero copy (+- some pointer assignments). This is important because our entire messaging API is about bridging the code generation from flatbuffers and seastar.

In addition to this, our RPC mechanism codegen’s seastar <–> flatbuffers glue code.

--------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- ... BM_alloc_simple/2/2/threads:1 586 ns 588 ns 1184218 BM_alloc_simple/4/4/threads:1 586 ns 589 ns 1181340 BM_alloc_simple/16/16/threads:1 598 ns 600 ns 1153577 BM_alloc_simple/256/256/threads:1 601 ns 602 ns 1147963 BM_alloc_simple/4096/4096/threads:1 824 ns 828 ns 837036 BM_alloc_simple/65536/65536/threads:1 6825 ns 6845 ns 101321 ...

To encode a 2 x 64KB buffers it takes 6.8 micros. Yikes! - But is this bad? (as of flatbuffers checkin 34cb163e389e928db08ed2bd0e16ee0ac53ab1ce ). Note this is only 3X the cost of std::malloc + std::memset , which is the fastest thing I can think of as a base comparison.

Before we dive into possible optimizations, let’s fix our desktop for these micro optimizations.

Fixing the environment

function cpu_disable_performance_cpupower_state (){ sudo cpupower frequency-set --governor powersave } function cpu_enable_performance_cpupower_state (){ sudo cpupower frequency-set --governor performance } function cpu_available_frequencies () { for i in /sys/devices/system/cpu/cpu[0-9] * ; do echo " $i :" echo " $i /cpufreq/scaling_min_freq: $( cat $i /cpufreq/scaling_min_freq ) " ; echo " $i /cpufreq/scaling_max_freq: $( cat $i /cpufreq/scaling_max_freq ) " ; done } function cpu_set_min_frequencies () { local freq = $1 ; if [[ $freq == "" ]] ; then exit 1 ; fi for i in /sys/devices/system/cpu/cpu[0-9] * ; do echo " $i :" echo " $i /cpufreq/scaling_min_freq: $( cat $i /cpufreq/scaling_min_freq ) " ; echo " $freq " | sudo tee " $i /cpufreq/scaling_min_freq" echo " $i /cpufreq/scaling_min_freq: $( cat $i /cpufreq/scaling_min_freq ) " ; done } function cpu_set_max_frequencies () { local freq = $1 ; if [[ $freq == "" ]] ; then exit 1 ; fi for i in /sys/devices/system/cpu/cpu[0-9] * ; do echo " $i :" echo " $i /cpufreq/scaling_max_freq: $( cat $i /cpufreq/scaling_max_freq ) " ; echo " $freq " | sudo tee " $i /cpufreq/scaling_max_freq" echo " $i /cpufreq/scaling_max_freq: $( cat $i /cpufreq/scaling_max_freq ) " ; done }

My CPU is a Intel(R) Xeon(R) CPU E3-1535M v6 @ 3.10GHz with a turbo of 4.20GHz. That’s great for desktop experience where interactivity matters and terrible for performance benchmarking.

When I first tried to show the results to my partner (non CS Major, but puts up w/ me asking her to stare at my screen) I had unpredictable results.

Armed w/ the shell funcs above, my checklist is now as follows:

Settings

Ensure that your BIOS says performance when connected to AC

Check your CPU freqencies via cat /proc/cpuinfo | grep model

Set your CPU governor via cpu_enable_performance_cpupower_state

Set the min frequency to the frequency reported by your CPU/model/vendor via: cpu_set_min_frequencies 3100000 in my case

in my case Set the max frequency to the frequency reported by your CPU/model/vendor via: cpu_set_max_frequencies 3100000 in my case

in my case Verify that you always build in Release mode

Verying the frequencies is now simple via cpu_available_frequencies

$> cpu_available_frequencies /sys/devices/system/cpu/cpu0: /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: 3100000 /sys/devices/system/cpu/cpu1: /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq: 3100000 /sys/devices/system/cpu/cpu2: /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq: 3100000 /sys/devices/system/cpu/cpu3: /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq: 3100000 /sys/devices/system/cpu/cpu4: /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq: 3100000 /sys/devices/system/cpu/cpu5: /sys/devices/system/cpu/cpu5/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq: 3100000 /sys/devices/system/cpu/cpu6: /sys/devices/system/cpu/cpu6/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu6/cpufreq/scaling_max_freq: 3100000 /sys/devices/system/cpu/cpu7: /sys/devices/system/cpu/cpu7/cpufreq/scaling_min_freq: 3100000 /sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq: 3100000

The take away is you don’t necessarily want speed, you want predictability.

Next - Compiler Explorer - my new fav tool

First, let’s execute the commands above and get the compiler flags that we’ll need for compiler explorer.

If you are using CMake, follow these screencasts and don’t forget to set set(CMAKE_EXPORT_COMPILE_COMMANDS 1) on your main CMakeLists.txt file





Now that we have the compiler flags ready, let’s fire up compiler explorer:

When you navigate to localhost:1024 you are welcomed with the usual and friendly CompilerExplorer UI.

The results

Running /home/agallego/workspace/smf/build/release/src/benchmarks/fbs_alloc/smf_fbsalloc_benchmark_test Run on ( 8 X 4200 MHz CPU s ) CPU Caches: L1 Data 32K ( x4 ) L1 Instruction 32K ( x4 ) L2 Unified 256K ( x4 ) L3 Unified 8192K ( x1 ) --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_malloc_base/2/2/threads:1 11 ns 11 ns 62858390 BM_malloc_base/4/4/threads:1 11 ns 11 ns 61565906 BM_malloc_base/16/16/threads:1 11 ns 11 ns 64118519 BM_malloc_base/256/256/threads:1 20 ns 20 ns 35574778 BM_malloc_base/4096/4096/threads:1 98 ns 98 ns 7123958 BM_malloc_base/65536/65536/threads:1 2339 ns 2336 ns 281254 BM_malloc_base/262144/262144/threads:1 11648 ns 11635 ns 58503 BM_alloc_simple/2/2/threads:1 586 ns 588 ns 1184218 BM_alloc_simple/4/4/threads:1 586 ns 589 ns 1181340 BM_alloc_simple/16/16/threads:1 598 ns 600 ns 1153577 BM_alloc_simple/256/256/threads:1 601 ns 602 ns 1147963 BM_alloc_simple/4096/4096/threads:1 824 ns 828 ns 837036 BM_alloc_simple/65536/65536/threads:1 6825 ns 6845 ns 101321 BM_alloc_simple/262144/262144/threads:1 32838 ns 32784 ns 21351 BM_alloc_thread_local/2/2/threads:1 557 ns 559 ns 1220122 BM_alloc_thread_local/4/4/threads:1 557 ns 559 ns 1221065 BM_alloc_thread_local/16/16/threads:1 569 ns 571 ns 1198623 BM_alloc_thread_local/256/256/threads:1 578 ns 580 ns 1181208 BM_alloc_thread_local/4096/4096/threads:1 806 ns 819 ns 845698 BM_alloc_thread_local/65536/65536/threads:1 10182 ns 10237 ns 68206 BM_alloc_thread_local/262144/262144/threads:1 43066 ns 42998 ns 16279 BM_alloc_hybrid/2/2/threads:1 563 ns 565 ns 1212235 BM_alloc_hybrid/4/4/threads:1 563 ns 566 ns 1209207 BM_alloc_hybrid/16/16/threads:1 577 ns 579 ns 1185042 BM_alloc_hybrid/256/256/threads:1 587 ns 589 ns 1166182 BM_alloc_hybrid/4096/4096/threads:1 843 ns 847 ns 819194 BM_alloc_hybrid/65536/65536/threads:1 6810 ns 6832 ns 101497 BM_alloc_hybrid/262144/262144/threads:1 32838 ns 32774 ns 21368

As expected, for small buffers, there is a large cost vs the base of malloc+memset, and at the higher ends, the allocation & byte traversal start to dominate. The range is ~58x (worst) - ~2.8X (best).

Our benched code

#include <cstring> #include <memory> #include <thread> #include <benchmark/benchmark.h> #include <core/print.hh> #include <core/sstring.hh> #include <core/temporary_buffer.hh> #include "kv_generated.h" #include "smf/native_type_utils.h" static inline kvpairT gen_kv ( uint32_t sz ) { kvpairT ret ; ret . key . resize ( sz , 'x' ); ret . value . resize ( sz , 'y' ); return ret ; } static void BM_malloc_base ( benchmark :: State & state ) { for ( auto _ : state ) { auto buf = reinterpret_cast < char *> ( std :: malloc ( 2 * state . range ( 0 ))); std :: memset ( buf , 'x' , 2 * state . range ( 0 )); benchmark :: DoNotOptimize ( buf ); std :: free ( buf ); } } BENCHMARK ( BM_malloc_base ) -> Args ({ 1 << 1 , 1 << 1 }) -> Args ({ 1 << 2 , 1 << 2 }) -> Args ({ 1 << 4 , 1 << 4 }) -> Args ({ 1 << 8 , 1 << 8 }) -> Args ({ 1 << 12 , 1 << 12 }) -> Args ({ 1 << 16 , 1 << 16 }) -> Args ({ 1 << 18 , 1 << 18 }) -> Threads ( 1 ); static void BM_alloc_simple ( benchmark :: State & state ) { for ( auto _ : state ) { state . PauseTiming (); auto kv = gen_kv ( state . range ( 0 )); state . ResumeTiming (); // build it! flatbuffers :: FlatBufferBuilder bdr ; bdr . Finish ( kvpair :: Pack ( bdr , & kv , nullptr )); auto mem = bdr . Release (); auto ptr = reinterpret_cast < char *> ( mem . data ()); auto sz = mem . size (); auto buf = seastar :: temporary_buffer < char > ( ptr , sz , seastar :: make_object_deleter ( std :: move ( mem ))); benchmark :: DoNotOptimize ( buf ); } } BENCHMARK ( BM_alloc_simple ) -> Args ({ 1 << 1 , 1 << 1 }) -> Args ({ 1 << 2 , 1 << 2 }) -> Args ({ 1 << 4 , 1 << 4 }) -> Args ({ 1 << 8 , 1 << 8 }) -> Args ({ 1 << 12 , 1 << 12 }) -> Args ({ 1 << 16 , 1 << 16 }) -> Args ({ 1 << 18 , 1 << 18 }) -> Threads ( 1 ); static void BM_alloc_thread_local ( benchmark :: State & state ) { static thread_local flatbuffers :: FlatBufferBuilder bdr ( 1024 ); for ( auto _ : state ) { state . PauseTiming (); auto kv = gen_kv ( state . range ( 0 )); state . ResumeTiming (); // key operations bdr . Clear (); bdr . Finish ( kvpair :: Pack ( bdr , & kv , nullptr )); // ned key operations // std::cout << "Size to copy: " << bdr.GetSize() << std::endl; void * ret = nullptr ; auto r = posix_memalign ( & ret , 8 , bdr . GetSize ()); if ( r == ENOMEM ) { throw std :: bad_alloc (); } else if ( r == EINVAL ) { throw std :: runtime_error ( seastar :: sprint ( "Invalid alignment of %d; allocating %d bytes" , 8 , bdr . GetSize ())); } DLOG_THROW_IF ( r != 0 , "ERRNO: {}, Bad aligned allocation of {} with alignment: {}" , r , bdr . GetSize (), 8 ); std :: memcpy ( ret , reinterpret_cast < const char *> ( bdr . GetBufferPointer ()), bdr . GetSize ()); benchmark :: DoNotOptimize ( ret ); std :: free ( ret ); } } BENCHMARK ( BM_alloc_thread_local ) -> Args ({ 1 << 1 , 1 << 1 }) -> Args ({ 1 << 2 , 1 << 2 }) -> Args ({ 1 << 4 , 1 << 4 }) -> Args ({ 1 << 8 , 1 << 8 }) -> Args ({ 1 << 12 , 1 << 12 }) -> Args ({ 1 << 16 , 1 << 16 }) -> Args ({ 1 << 18 , 1 << 18 }) -> Threads ( 1 ); template < typename RootType > seastar :: temporary_buffer < char > hybrid ( const typename RootType :: NativeTableType & t ) { static thread_local auto builder = std :: make_unique < flatbuffers :: FlatBufferBuilder > (); auto & bdr = * builder . get (); bdr . Clear (); bdr . Finish ( RootType :: Pack ( bdr , & t , nullptr )); if ( SMF_UNLIKELY ( bdr . GetSize () > 2048 )) { auto mem = bdr . Release (); auto ptr = reinterpret_cast < char *> ( mem . data ()); auto sz = mem . size (); // fix the original builder builder = std :: make_unique < flatbuffers :: FlatBufferBuilder > (); return seastar :: temporary_buffer < char > ( ptr , sz , seastar :: make_object_deleter ( std :: move ( mem ))); } // always allocate to the largest member 8-bytes void * ret = nullptr ; auto r = posix_memalign ( & ret , 8 , bdr . GetSize ()); if ( r == ENOMEM ) { throw std :: bad_alloc (); } else if ( r == EINVAL ) { throw std :: runtime_error ( seastar :: sprint ( "Invalid alignment of %d; allocating %d bytes" , 8 , bdr . GetSize ())); } DLOG_THROW_IF ( r != 0 , "ERRNO: {}, Bad aligned allocation of {} with alignment: {}" , r , bdr . GetSize (), 8 ); seastar :: temporary_buffer < char > retval ( reinterpret_cast < char *> ( ret ), bdr . GetSize (), seastar :: make_free_deleter ( ret )); std :: memcpy ( retval . get_write (), reinterpret_cast < const char *> ( bdr . GetBufferPointer ()), retval . size ()); return std :: move ( retval ); } static void BM_alloc_hybrid ( benchmark :: State & state ) { for ( auto _ : state ) { state . PauseTiming (); auto kv = gen_kv ( state . range ( 0 )); state . ResumeTiming (); auto buf = hybrid < kvpair > ( kv ); benchmark :: DoNotOptimize ( buf ); } } BENCHMARK ( BM_alloc_hybrid ) -> Args ({ 1 << 1 , 1 << 1 }) -> Args ({ 1 << 2 , 1 << 2 }) -> Args ({ 1 << 4 , 1 << 4 }) -> Args ({ 1 << 8 , 1 << 8 }) -> Args ({ 1 << 12 , 1 << 12 }) -> Args ({ 1 << 16 , 1 << 16 }) -> Args ({ 1 << 18 , 1 << 18 }) -> Threads ( 1 ); BENCHMARK_MAIN ();

Lessons Learned: Benchmarking with turbo is dishonest.

Not that people writing OSS (or any software really) are out there to get you. It is just easy to forget to tune your machine specifically for benchmarking. There is no such thing as a quick and dirty benchmark especially if the results are not categorically different, i.e.: 1 minute vs 10 mins vs 1hr.

Let’s compare the stability of multiple runs with turbo vs without.

That’s a stddev difference of up to 768X !!!

No-turbo means precisely this:

cpu_available_frequencies /sys/devices/system/cpu/cpu0: /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: 4200000 /sys/devices/system/cpu/cpu1: /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq: 4200000 /sys/devices/system/cpu/cpu2: /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq: 4200000 /sys/devices/system/cpu/cpu3: /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq: 4200000 /sys/devices/system/cpu/cpu4: /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq: 4200000 /sys/devices/system/cpu/cpu5: /sys/devices/system/cpu/cpu5/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq: 4200000 /sys/devices/system/cpu/cpu6: /sys/devices/system/cpu/cpu6/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu6/cpufreq/scaling_max_freq: 4200000 /sys/devices/system/cpu/cpu7: /sys/devices/system/cpu/cpu7/cpufreq/scaling_min_freq: 800000 /sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq: 4200000

SMF

smf just got 40% faster for large buffers thanks to these benchmarks, if you give it a shot let me know. Stay tuned for the Java and Go code generators and performance benchmarks coming soon.





Let me know if you found this useful, any missinformation, or additional performance tunning on twitter @emaxerrno.







Special thanks to my partner Sarah Rohrbach as well as Chris Heller, Noah Watkins for reading earlier drafts of this post.