Improving SmallVec's speed by 60% and why that shouldn't matter to you

smallvec is a library by the Servo team for reducing the number of allocations for dynamic arrays in the case that most of those arrays are below a certain size. For example, if you’re creating a vector that normally has less than 100 items but could have more, you can reduce the load on the allocator by using a SmallVec like so:

fn print_iterator < T > ( i : T ) where T : Iterator , T :: Item : Debug , { // Before... let vec : Vec < T :: Item > = i . collect (); // After let vec : SmallVec < [ T :: Item ; 100 ] > = i . collect (); println ! ( "{:?}" , vec ); }

The implementation is conceptually very simple, essentially looking like the following:

// Implemented for `[T; 1]`, `[T; 2]`, etc. trait Array { type Item ; fn ptr ( & self ) -> * const Item ; fn ptr_mut ( & self ) -> * mut Item ; fn size () -> usize ; } enum SmallVec < T : Array > { Inline { len : usize , buffer : T , }, Heap ( Vec < T :: Item > ), }

The real implementation is more complex for optimisation purposes, but it’s essentially what you see above. If the length is less than T::size() it will store the elements inline (on the stack), if it becomes larger than that then it will copy the elements onto the heap - incurring a cost to allocate and to copy the existing elements. The main benefits to this are not from avoiding allocation, but from better cache efficiency when you store SmallVec values in an array.

Because malloc is fast, for many cases it’s actually slower to use SmallVec than just using Vec because the one-time cost of the initial allocation is dwarfed by the lifetime cost of SmallVec 's increased complexity. You can see that switching to Vec actually improves speed on many of SmallVec 's own benchmarks:

name smallvec.bench ns/iter vec.bench ns/iter diff ns/iter diff % speedup extend_from_slice 52 65 13 25.00% x 0.80 extend_from_slice_small 23 29 6 26.09% x 0.79 extend 200 69 -131 -65.50% x 2.90 extend_small 39 28 -11 -28.21% x 1.39 from_slice 199 36 -163 -81.91% x 5.53 from_slice_small 38 30 -8 -21.05% x 1.27 insert 1,222 1,231 9 0.74% x 0.99 insert_small 121 114 -7 -5.79% x 1.06 macro_from_elem 218 40 -178 -81.65% x 5.45 macro_from_elem_small 60 32 -28 -46.67% x 1.88 push 357 369 12 3.36% x 0.97 push_small 57 53 -4 -7.02% x 1.08 macro_from_list 47 33 -14 -29.79% x 1.42 pushpop 306 253 -53 -17.32% x 1.21

Vec is only slower for one method - extend_from_slice . Weirdly SmallVec is even faster in the case that extend_from_slice must allocate. From a cursory glance at the source code for std it looks like SmallVec might get an edge on inlining, since Vec::extend_from_slice isn’t marked #[inline] .

Even with the improvement on extend_from_slice , it’s still a tough sell to use SmallVec . Even in the case where SmallVec is on the stack - the case where it’s supposed to have an edge - it’s still slower.

Wait, though, isn’t the only difference between the two the fact that SmallVec has to check whether it’s inline or heap-allocated? Why is from_elem (which tests the vec![val; N] macro and the equivalent smallvec![val; N] macro) more than 5 times slower for SmallVec , and why is extend almost 3 times as slow, even being slower when it’s on the stack? Well it turns out that apparently this library had some serious inefficiencies that are not inherent to the design. Let’s check out what SmallVec::from_elem and SmallVec::extend look like:

impl < A : Array > SmallVec < A > { pub fn from_elem ( elem : A :: Item , n : usize ) -> Self { let mut v = SmallVec :: with_capacity ( n ); v . insert_many ( 0 , ( 0 .. n ). map ( | _ | elem . clone ())); v } // ... } impl < A : Array > Extend < A :: Item > for SmallVec < A > { fn extend < I > ( & mut self , iterable : I ) where I : IntoIterator < Item = A :: Item > { let iter = iterable . into_iter (); let ( lower_size_bound , _ ) = iter . size_hint (); let target_len = self . len + lower_size_bound ; if target_len > self . capacity () { self . grow ( target_len ); } for elem in iter { self . push ( elem ); } } }

Both seem pretty clear, for extend you reserve the expected number of elements and then do a for loop over the iterator, pushing each element to the new SmallVec . For from_elem you create a SmallVec preallocated with the number of elements you need and then insert n clones of elem . Hang on, though, what does the source code of insert_many look like?

impl < A : Array > SmallVec < A > { // ... pub fn insert_many < I : IntoIterator < Item = A :: Item >> ( & mut self , index : usize , iterable : I ) { let iter = iterable . into_iter (); let ( lower_size_bound , _ ) = iter . size_hint (); assert ! ( lower_size_bound <= std :: isize :: MAX as usize ); // Ensure offset is indexable assert ! ( index + lower_size_bound >= index ); // Protect against overflow self . reserve ( lower_size_bound ); unsafe { let old_len = self . len ; assert ! ( index <= old_len ); let ptr = self . as_mut_ptr (). offset ( index as isize ); ptr :: copy ( ptr , ptr . offset ( lower_size_bound as isize ), old_len - index ); for ( off , element ) in iter . enumerate () { if off < lower_size_bound { ptr :: write ( ptr . offset ( off as isize ), element ); self . len = self . len + 1 ; } else { // Iterator provided more elements than the hint. assert ! ( index + off >= index ); // Protect against overflow. self . insert ( index + off , element ); } } let num_added = self . len - old_len ; if num_added < lower_size_bound { // Iterator provided fewer elements than the hint ptr :: copy ( ptr . offset ( lower_size_bound as isize ), ptr . offset ( num_added as isize ), old_len - index ); } } } }

Gah! That’s a lot of code. Let’s go through this bit-by-bit (I’ve removed assertions to make the logic clearer):

let ( lower_size_bound , _ ) = iter . size_hint (); self . reserve ( lower_size_bound );

This reserves enough space for the iterator’s self-reported lower bound on its number of produced elements. Well hang on, we already know that we have enough space since we just created it with with_capacity and then used an iterator with a fixed size. So this is useless. We don’t need to reserve any more space.

let old_len = self . len ; let ptr = self . as_mut_ptr (). offset ( index as isize ); ptr :: copy ( ptr , ptr . offset ( lower_size_bound as isize ), old_len - index );

This optimistically copies any existing elements to the end of the array. This is because this function allows you to insert the new elements at any point - not just at the end - and so if there are existing elements it needs to move them first. Again, these lines are useless. If we’re calling this from from_elem , we know that both self.len and index are both 0, so this will always copy 0 bytes (a no-op). We still waste cycles figuring this out, though.

for ( off , element ) in iter . enumerate () { if off < lower_size_bound { ptr :: write ( ptr . offset ( off as isize ), element ); self . len = self . len + 1 ; } else { self . insert ( index + off , element ); } }

For every element we do a branch. This means that the compiler can’t optimise this into a memcpy (the fastest possible method of copying the bytes) because for each element we need to check whether or not we’ve reached the lower size bound. Technically the compiler could optimise this by turning it into two loops (known as loop fission) and then noticing that the first loop is equivalent to a memcpy , but relying on it to do so is inadvisable because it requires multiple optimisation passes to be applied in the correct order. Again, when we’re calling this function from from_elem we know that this can only ever go through the first branch and never the second, so this prevents optimisations and wastes time checking off < lower_size_bound for no benefit.

let num_added = self . len - old_len ; if num_added < lower_size_bound { // Iterator provided fewer elements than the hint ptr :: copy ( ptr . offset ( lower_size_bound as isize ), ptr . offset ( num_added as isize ), old_len - index ); }

This is another branch that’s always skipped when we’re doing from_elem , since we know that the iterator produces the right amount of elements every time. More wasted cycles.

I’m not going to go line-by-line into precisely how Vec does it, but you can check out the code here. The important thing is that for many types it’s about as fast as is reasonably possible. A lot of work has been put into optimising this and it shows. There is even a special method that uses the operating system’s ability to return zeroed memory “magically” if the element you pass to the function is equal to 0 (i.e. Vec requests a pointer to zeroed memory and the OS returns a pointer to memory that it knows should be zeroed, but doesn’t actually zero it until you try to access it). It only increases the vector’s length once, after all elements have been added. This is much better than our method of increasing the vector’s length by 1 each iteration of the loop because you only have to do one addition. The bigger the Vec , the bigger difference this will make. The biggest benefit is that it uses specialisation to use memcpy for any element that implements Copy . This means that when possible it will copy using SIMD.

The optimisations that require specialisation are actually impossible to implement in SmallVec without requiring a nightly compiler, but it turns out that it doesn’t matter. You can just reuse Vec::from_elem (which is what vec![elem; n] desugars to) when you know it’s going to end up on the heap, and do a simple loop-and-write when you know it will be on the stack. Here’s what that looks like:

pub fn from_elem ( elem : A :: Item , n : usize ) -> Self { if n > A :: size () { vec ! [ elem ; n ]. into () } else { unsafe { let mut arr : A = :: std :: mem :: uninitialized (); let ptr = arr . ptr_mut (); for i in 0 .. n as isize { :: std :: ptr :: write ( ptr . offset ( i ), elem . clone ()); } SmallVec { data : Inline { array : arr }, len : n , } } } }

We don’t need to use the OS’s magical ability to conjure up zeroed memory when we’re building things on the stack since that works at the granularity of a page - 4kb by default on Linux - and if you’re creating an array that large then the cost of allocation is absolutely dwarfed by the cost of initialisation. LLVM can trivially optimise the for loop into a memcpy and even a memset when A::Item = u8 (the comment at the header of the file lists this optimisation as a TODO but if you look at the code you can see that it’s already been implemented). We get all the benefits of Vec 's complex implementation with the only cost being a n > A::size() check, which is especially cheap because A::size() is a constant. You can see the generated assembly here, I haven’t commented it because of the sheer amount of code that was generated before, but just the difference in the amount of code should make it clear that this method is significantly more efficient.

After this, we see a huge improvement:

name smallvec.bench ns/iter vec.bench ns/iter diff ns/iter diff % speedup macro_from_elem 66 50 -16 -24.24% x 1.32

Vec is still faster, but now the difference is only 16ns instead of 130ns. I do wonder if there’s some way to speed up the conversion from Vec to SmallVec , because that’s the only place that the difference could be coming from and it should be essentially instant.

Now for the fun one. Let’s see SmallVec::extend again:

impl < A : Array > Extend < A :: Item > for SmallVec < A > { fn extend < I > ( & mut self , iterable : I ) where I : IntoIterator < Item = A :: Item > { let iter = iterable . into_iter (); let ( lower_size_bound , _ ) = iter . size_hint (); let target_len = self . len + lower_size_bound ; if target_len > self . capacity () { self . grow ( target_len ); } for elem in iter { self . push ( elem ); } } }

So what’s wrong with this? Well, let’s take a look at the source of push :

pub fn push ( & mut self , value : A :: Item ) { let cap = self . capacity (); if self . len == cap { self . grow ( cmp :: max ( cap * 2 , 1 )) } unsafe { let end = self . as_mut_ptr () . offset ( self . len as isize ); ptr :: write ( end , value ); let len = self . len ; self . set_len ( len + 1 ) } }

It’s the same problem again - we know that we don’t need to grow for the first lower_size_bound iterations (because we already preallocated that many elements) but we do anyway. Plus we’re making LLVM’s job hard - it could figure out that because we set self.capacity and then check it against self.len in each iteration of the loop then it can split the loop in two and omit the check for the first self.capacity - self.len iterations, and then from that figure out that the writes can be vectorised, but it’s again relying on a specific set of optimisations in a specific order. Also, we store to self.len every iteration of the loop and although LLVM could load self.len to a register, increment it there, and then store it again, that is not a simple enough optimisation for us to expect it to be reliably performed. Besides, that optimisation cannot be performed for an iterator that may panic, since in that case we need to ensure that self.len is correct after each iteration of the loop. This is because other code could observe it after the panic (even without catch_panic , code in implementations of Drop::drop could observe it). We can do this optimisation ourself and make the code easier to understand by LLVM:

fn extend < I > ( & mut self , iterable : I ) where I : IntoIterator < Item = A :: Item > { let mut iter = iterable . into_iter (); let ( lower_size_bound , _ ) = iter . size_hint (); let target_len = self . len + lower_size_bound ; if target_len > self . capacity () { self . grow ( target_len ); } // This section is new unsafe { let ptr = self . as_mut_ptr () . offset ( self . len () as isize ); let len = self . len (); let mut count = 0 ; for p in 0 .. lower_size_bound as isize { if let Some ( out ) = iter . next () { :: std :: ptr :: write ( ptr . offset ( p ), out ); count += 1 ; } else { break ; } } self . set_len ( len + count ); } for elem in iter { self . push ( elem ); } }

You see that in the fast path (the path where iter.size_hint() tells the truth) we just continuously ptr::write . We branch in the loop, but for most iterators the branch could either easily be removed by the compiler after monomorphisation - for example, for slices, vectors and wrappers around either of those that preserve size_hint . For those where it isn’t, the else branch is taken at most once per run of the loop. This makes it extremely predictable even for CPUs with only basic branch predictors. We then do a “real” store to the len field only once, and count can easily be put into a register. This optimisation isn’t defeated by an iterator that may panic, either. Since we only store to memory after the loop, if it panics during the loop then the calling code will just see the value of self.len from before we entered the loop. We can do this transformation because we know that leaking data when panicking is fine, but LLVM cannot because it would change the meaning of the program.

We can see that the difference after this optimisation is large:

name smallvec.bench ns/iter vec.bench ns/iter diff ns/iter diff % speedup extend 102 70 -32 -31.37% x 1.46 extend_small 25 28 3 12.00% x 0.89

We’re much closer to Vec 's speed than before, and at last SmallVec is faster for the case where it doesn’t have to allocate.