Benchmarking in Rust

16 June 2015

Recently on r/rust, I had an interesting discussion regarding some benchmarks where different map implementations led to wildly different results.

As performance-related programming is one of my main job-related activities, a look into rust’s facilities to that effect would be useful both to me and others. So I set out to craft a benchmark that could lead to an even faster implementation.

Rust’s libtest has a benchmark harness. So if a program gets compiled with --test , one can simply call it with --bench to run all contained benchmarks, where any #[bench] fn ..(b: &mut test::Bencher) function constitutes a benchmark.

Note: This needs a nightly Rust. If you intend to put benchmarks next to your crate’s code, I suggest putting them in the benches/ subfolder, so nightly Rust will pick it up while stable and/or beta will stay undisturbed.

So without further ado, here’s my first benchmark:

extern crate test ; extern crate rand ; use test :: Bencher ; use rand :: Rng ; use std :: mem :: replace ; #[bench] fn empty ( b : & mut Bencher ) { b .iter (|| 1 ) } #[bench] fn setup_random_hashmap ( b : & mut Bencher ) { let mut val : u32 = 0 ; let mut rng = rand :: IsaacRng :: new_unseeded (); let mut map = std :: collections :: HashMap :: new (); b .iter (|| { map .insert ( rng .gen :: < u8 > () as usize , val ); val += 1 ; }) } #[bench] fn setup_random_vecmap ( b : & mut Bencher ) { let mut val : u32 = 0 ; let mut rng = rand :: IsaacRng :: new_unseeded (); let mut map = std :: collections :: VecMap :: new (); b .iter (|| { map .insert (( rng .gen :: < u8 > ()) as usize , val ); val += 1 ; }) } #[bench] fn setup_random_vecmap_cap ( b : & mut Bencher ) { let mut val : u32 = 0 ; let mut rng = rand :: IsaacRng :: new_unseeded (); let mut map = std :: collections :: VecMap :: with_capacity ( 256 ); b .iter (|| { map .insert (( rng .gen :: < u8 > ()) as usize , val ); val += 1 ; }) }

The empty benchmark is there as a baseline. An anecdote: In my first compilation of the benchmark, I forgot to add -O to the rustc command line, and wound up with a few ns/iter on an empty benchmark. Thus, I now always have an empty benchmark in my list, to make sure I benchmark an optimized version.

One thing you notice it that there is a lot of code duplication. Normally I’ll try to avoid that but in this simple case, I’m ok with it.

Running this results in the following:

rustc -C opt-level=3 bench.rs --test && ./bench --bench running 7 tests test empty ... bench: 0 ns/iter (+/- 0) test setup_random_hashmap ... bench: 47 ns/iter (+/- 4) test setup_random_vecmap ... bench: 9 ns/iter (+/- 0) test setup_random_vecmap_cap ... bench: 9 ns/iter (+/- 0) test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Here we see that vecmap really blows hashmap away. However, vecmap internally uses a Vec<Option<T>> which I believe should be quite inefficient. Can we do better?

First I tried with a DefaultVecMap, which gets one default value that takes the place of Option::None , so we can store the values directly without the overhead of Option . Note that this is not a complete implementation, I just measure insert performance here, so this is what I implement:

#[derive(Clone)] pub struct DefaultVecMap < V : PartialEq < V >> { default : V , v : Vec < V > , } impl < V : PartialEq < V > + Copy > DefaultVecMap < V > { pub fn new ( default : V ) -> DefaultVecMap < V > { DefaultVecMap { default : default , v : vec! [], } } pub fn with_capacity ( default : V , capacity : usize ) -> DefaultVecMap < V > { let dc = default .clone (); DefaultVecMap { default : default , v : vec! [ dc ; capacity ], } } pub fn insert ( & mut self , key : usize , value : V ) -> Option < V > { debug_assert! ( & value != & self .default ); let len = self .v .len (); if len <= key { let default = & self .default ; let mut l = if len == 0 { 1 } else { len }; while l <= key && l > 0 { l = l << 2 ; } self .v .extend (( 0 .. l - len + 1 ) .map (| _ | default )); self .v [ key ] = value ; None } else { let old_value = replace ( & mut self .v [ key ], value ); if old_value == self .default { None } else { Some ( old_value ) } } } }

This gets us:

test setup_default_vecmap ... bench: 9 ns/iter (+/- 0) test setup_default_vecmap_cap ... bench: 9 ns/iter (+/- 0)

This is equal to vecmap performance. Oh, well. Noting that the benchmark that started this endeavor only uses non-zero u32 values, I created the following:

pub struct ByteNonMaxU32Map { v : Vec < u32 > } impl ByteNonMaxU32Map { pub fn new () -> ByteNonMaxU32Map { ByteNonMaxU32Map { v : vec! [], } } pub fn insert ( & mut self , key : usize , value : u32 ) -> Option < u32 > { let len = self .v .len (); if len <= key { self .v .extend (( 0 .. key - len + 1 ) .map (| _ | std :: u32 :: MAX )); self .v [ key ] = value ; None } else { let old_value = replace ( & mut self .v [ key ], value ); if old_value == std :: u32 :: MAX { None } else { Some ( old_value ) } } } }

which gives us a nice 33% speedup on insert:

test setup_nonzero_u32_map ... bench: 6 ns/iter (+/- 0)

Nice! Specialization for the win!

Bonus round: Just when I had finished writing this, along comes someone on rust users and asks why while loops are faster than for-range loops, which sets off alarm bells in my head.

I’ll spare you the gory details, but the benchmark misused test::black_box to keep LLVM from inlining the whole range construction. Changing the benchmark to let LLVM do its thing (while being reasonably sure that it wasn’t optimizing away the code I wanted to measure) required adding another mutable variable.

With this, the for-loop even performed a little better than the while loop, because LLVM did actually unroll it a bit to make use of wider instructions. You can see the details on the users thread.