This article is also available in Russian.

What’s faster, incrementing a std::vector of uint8_t or of uint32_t ?

To make it concrete, let’s consider these two implementations:

void vector8_inc ( std :: vector < uint8_t >& v ) { for ( size_t i = 0 ; i < v . size (); i ++ ) { v [ i ] ++ ; } } void vector32_inc ( std :: vector < uint32_t >& v ) { for ( size_t i = 0 ; i < v . size (); i ++ ) { v [ i ] ++ ; } }

Let’s Guess

It’s easy to answer this question with a benchmark, and we will get to that soon, but for now let’s take some guesses (also known as “reasoning from first principles” if you want to pretend there is a lot of science in it).

First, we might reasonably ask: How big are these vectors?

Let’s pick a number then. I’m going with 20,000 elements in each.

Then knowing the hardware we’re testing on, Intel Skylake, we can check the characteristics of 8-bit and 32-bit add immediate instructions. It turns out that their primary performance characteristics are identical: 1 per cycle throughput and 4 cycles latency through memory. In this case, we don’t expect latency to matter since each add is independent, so we might hope this runs at 1 cycle per element, if the rest of the loop overhead can run in parallel.

We might also note that 20,000 elements means a working set of 20 KB for the uint8_t vector, but 80 KB for the uint32_t vector. The former fits cleanly in the L1 cache of modern x86 machines, while the second doesn’t. So maybe the 8-bit version will have a leg up due to caching effects?

Finally, we might note that this problem seems like a textbook case for auto-vectorization: a loop with a known number of iterations doing an arithmetic operation on adjacent elements in memory. In that case, we might expect the 8-bit version to have a huge advantage over the 32-bit version, since every vector operation will process 4 times as many elements, and in general, on Intel, the byte element vector operations have the same performance as their 32-bit element counterparts.

Enough prologue. Let’s benchmark this.

The Benchmark

For 20,000 elements, I get the following timings with gcc 8 and clang 8 at various optimization levels:

Compiler Element -O1 -O2 -O3 gcc 8 uint8_t 2.0 2.0 2.0 gcc 8 uint32_t 2.3 1.3 0.2 clang 8 uint8_t 9.2 2.0 2.0 clang 8 uint32_t 9.2 0.2 0.2

The conclusion is that except at -O1 , uint32_t is faster than uint8_t and sometimes dramatically so: for gcc at -O3 the speedup was 5.4x and for clang at either -O2 and -O3 the speedup was 8.0x. Yes, incrementing 32-bit integers in a std::vector is up to eight times faster than incrementing 8-bits on a popular compiler with common optimization settings.

As usual, we hope the assembly will tell us what is going on.

Here’s gcc 8 at -O2 where the 32-bit version was “only” 1.5x faster than the 8-bit one:

8-bit:

.L3: inc BYTE PTR [ rdx + rax ] mov rdx , QWORD PTR [ rdi ] inc rax mov rcx , QWORD PTR [ rdi + 8 ] sub rcx , rdx cmp rax , rcx jb .L3

32-bit:

.L9: inc DWORD PTR [ rax ] add rax , 4 cmp rax , rdx jne .L9

The 32-bit code looks like we’d expect from a not-unrolled loop: an increment with a memory destination, and then three loop control instructions: add rax, 4 to increment the induction variable, and a cmp jne pair to check and branch on the loop termination condition. It looks fine – an unroll would help to amortize the loop overhead, getting us closer to 1 cycle/element store limit, but good enough for open source work.

So What’s going on with the 8-bit version? In addition to the memory-destination inc , you have two other loads from memory and some stuff like a sub that appeared out of nowhere.

Here’s an annotated version:

8-bit:

.L3: inc BYTE PTR [ rdx + rax ] ; increment memory at v[i] mov rdx , QWORD PTR [ rdi ] ; load v.begin inc rax ; i++ mov rcx , QWORD PTR [ rdi + 8 ] ; load v.end sub rcx , rdx ; end - start (i.e., vector.size()) cmp rax , rcx ; i < size() jb .L3 ; next itr if i < size()

Here vector::begin and vector::end are the internal pointers that std::vector keeps to indicate the beginning and of the elements stored in its internal storage - basically the same values used to implement vector::begin() and vector::end() (although the type differs). So all the extra instructions appear as a consequence of calculating vector.size() . Maybe doesn’t seem too weird? Well maybe it does, because of course the 32-bit version uses size() also, but the size() related instructions don’t appear that loop. Instead, the size() calculation happens once, outside the loop.

What’s the difference? The short answer is pointer aliasing. The long answer is up next.

The Long Answer

The vector v is passed to the function as a reference, which boils down to a pointer under the covers. The compiler needs to use the v::begin and v::end members of the vector to calculate size() , and in the source we’ve written size() is evaluated every iteration. The compiler isn’t necessarily a slave to the source however: it can hoist the result of the size() function above the loop, but only if it can prove that this doesn’t change the semantics of the program. In that regard, the only problematic part of the loop is the increment v[i]++ . This writes to some unknown memory location. The question is: can that write change the value of size() ?

In the case of writes to a std::vector<uint32_t> (which boil down to writes to uint32_t * ) the answer is no, it cannot change size(). Writes to uint32_t objects can only change uint32_t objects, and the pointers involved in the calculation of size() are not uint32_t objects.

On the other hand, for uint8_t , at least for common compilers, the answer is yes, it could in theory change the value of size() , because uint8_t is a typedef for unsigned char , and unsigned char (and char ) arrays are allowed to alias any type. That means that writes through uint8_t pointers are treated as potentially updating any memory of unknown provenance. So the compiler assumes that every increment v[i]++ potentially modifies size() and hence must recalculate it every iteration.

Now, you and I know writes to the memory pointed to by a std::vector are never going modify its own size() - that would mean that the vector object itself has somehow been allocated inside its own heap storage, an almost impossible chicken-and-egg scenario. Unfortunately, the compiler doesn’t know that!

What about the rest

So that explains the small performance difference between the uint8_t and uint32_t for gcc -O2 . How about the huge, up to 8x difference, we see for clang or gcc at -O3 ?

That’s easy - clang can autovectorize this loop in the uint32_t case:

.LBB1_6: vmovdqu ymm1 , ymmword ptr [ rax + 4 * rdi ] vmovdqu ymm2 , ymmword ptr [ rax + 4 * rdi + 32 ] vmovdqu ymm3 , ymmword ptr [ rax + 4 * rdi + 64 ] vmovdqu ymm4 , ymmword ptr [ rax + 4 * rdi + 96 ] vpsubd ymm1 , ymm1 , ymm0 vpsubd ymm2 , ymm2 , ymm0 vpsubd ymm3 , ymm3 , ymm0 vpsubd ymm4 , ymm4 , ymm0 vmovdqu ymmword ptr [ rax + 4 * rdi ], ymm1 vmovdqu ymmword ptr [ rax + 4 * rdi + 32 ], ymm2 vmovdqu ymmword ptr [ rax + 4 * rdi + 64 ], ymm3 vmovdqu ymmword ptr [ rax + 4 * rdi + 96 ], ymm4 vmovdqu ymm1 , ymmword ptr [ rax + 4 * rdi + 128 ] vmovdqu ymm2 , ymmword ptr [ rax + 4 * rdi + 160 ] vmovdqu ymm3 , ymmword ptr [ rax + 4 * rdi + 192 ] vmovdqu ymm4 , ymmword ptr [ rax + 4 * rdi + 224 ] vpsubd ymm1 , ymm1 , ymm0 vpsubd ymm2 , ymm2 , ymm0 vpsubd ymm3 , ymm3 , ymm0 vpsubd ymm4 , ymm4 , ymm0 vmovdqu ymmword ptr [ rax + 4 * rdi + 128 ], ymm1 vmovdqu ymmword ptr [ rax + 4 * rdi + 160 ], ymm2 vmovdqu ymmword ptr [ rax + 4 * rdi + 192 ], ymm3 vmovdqu ymmword ptr [ rax + 4 * rdi + 224 ], ymm4 add rdi , 64 add rsi , 2 jne .LBB1_6

The loop as been unrolled 8x, and this is basically as fast as you’ll get, approaching one vector (8 elements) per cycle in the L1 cache (limited by one store per cycle).

No vectorization happens for the uint8_t case, because need to recalculate size() check for loop termination in between every element completely inhibits it. So the ultimate cause is the same as the first case we looked at, but the impact is much bigger.

Auto-vectorization explains all the very fast results: gcc only autovectorizes at -O3 but clang does it at -O2 and -O3 by default. The -O3 result for gcc is somewhat slower than clang because it does not unroll the autovectorized loop.

Fixing it

So now that we know what the problem is, how do we fix it?

First, we’ll try one that doesn’t work – using a more idiomatic iterator-based loop, like this:

for ( auto i = v . begin (); i != v . end (); ++ i ) { ( * i ) ++ ; }

It generates somewhat better code, compared to the size() version, at gcc -O2 :

.L17: add BYTE PTR [ rax ], 1 add rax , 1 cmp QWORD PTR [ rdi + 8 ], rax jne .L17

Two extra reads have turned into one, because you just compare to the end pointer in the vector, rather than recalculate size() from both the begin and end pointers. It actually ties the uint32_t version in instruction count since the extra load was folded into the comparison. However, the underlying problem is still there and auto-vectorization is inhibited, so there is still a very large gap between uint8_t and uint32_t - more than 5x for both gcc and clang for build where auto-vectorization is allowed.

Let’s try again. Again, we will fail or rather perhaps succeed in finding another way that doesn’t work.

This is the one where we calculate size() only once, before the loop, and stash it in a local variable:

for ( size_t i = 0 , s = v . size (); i < s ; i ++ ) { v [ i ] ++ ; }

It seems like this should work, right? The problem was size() and now we are saying “freeze size() into a local s at the start of the loop, and the compiler knows locals cannot overlap with anything”. Basically we manually hoist the thing the compiler couldn’t. It actually does generate better code (compared to the original version):

.L9: mov rdx , QWORD PTR [ rdi ] add BYTE PTR [ rdx + rax ], 1 add rax , 1 cmp rax , rcx jne .L9

There is only one extra read and no sub . So what is that extra read ( rdx, QWORD PTR [rdi] ) doing if it isn’t part of calculating the size? It’s loading the data() pointer from v !

Effectively, the implementation of v[i] is *(v.data() + i) and the member underlying data() (which is fact just the begin pointer) has the same potential problem as size() . I just didn’t really notice it the original example because we got that read “for free” since we were loading it anyways to get the size.

We’re almost there, I promise. We just need to remove the dependence on anything stored inside the std::vector in the loop. This is easiest if we modify the idiomatic iterator based approach like this:

for ( auto i = v . begin (), e = v . end (); i != e ; ++ i ) { ( * i ) ++ ; }

The results are dramatic (here we compare only uint8_t loops, with and without saving the end iterator before the loop):

Compiler Loop Check -O1 -O2 -O3 gcc 8 i != v.end() 1.3 1.3 1.3 gcc 8 i != e 1.3 1.3 0.06 gcc speedup 1x 1x 20x clang 8 i != v.end() 29.9 1.3 1.3 clang 8 i != e 20.3 0.06 0.06 clang speedup 1.5x 20x 20x

After this small change, the speedup is a factor of twenty for the cases where vectorization is able to kick in. We didn’t just tie the uint32_t case either: for gcc -O3 and clang -O2 and -O3 the speedup of the uint8_t code over uint32_t is almost exactly 4x, as we originally expected after vectorization: after all, 4 times as many elements are handled per vector, and we need 4 times less bandwidth to whatever level of cache is involved.

Some of you who’ve gotten this far have probably screaming, nearly since the start:

What about the range-based for loop introduced in C++11?

Good news! It works fine. It is syntactic sugar for almost exactly the iterator-based version above, with the end pointer captured once before the loop, and so it performs equivalently.

At the opposite end of the scale from modern C++, if I had just summoned my inner caveman and written a C-like function, we would have also been fine:

void array_inc ( uint8_t * a , size_t size ) { for ( size_t i = 0 ; i < size ; i ++ ) { a [ i ] ++ ; } }

Here, the array pointer a and the size are by-value function parameters, so they cannot be be modified by any writes through a , just like local variables. It performs equivalently to the other successful options.

Finally, for compilers that offer it, you can use __restrict on the vector declaration, like this:

void vector8_inc_restrict ( std :: vector < uint8_t >& __restrict v ) { for ( size_t i = 0 ; i < v . size (); i ++ ) { v [ i ] ++ ; } }

The __restrict keyword is not standard C++, but it is in C since C99 (as restrict ). For those compilers that implement it as an extension in C++, you can reasonably expect that it follows the C semantics. Of course, C doesn’t have references at all - so you can mentally substitute the vector reference with a pointer to a vector.

Note that restrict is not transitive: the __restrict qualifier on the std::vector reference doesn’t apply restrict to the heap storage backing v.data() , it only applies to the members of vector itself. In this case, that is enough because (as we saw above with local variables), proving to the compiler that the vector start/finish members themselves don’t overlap anything is enough. The distinction is still relevant: it means that writes to v.data() may still cause aliasing problems for other objects in your function.

Disappointment

That brings us to our last, and perhaps most disappointing note: all of the ways we’ve fixed the function above really only apply directly to this specific case of a vector interfering with itself. We’ve fixed it by hoisting or isolating the state backing the size() or end() calls in the vector, and not by teaching the compiler that writes to the vector data don’t modify anything. That doesn’t scale well as our function grows.

The aliasing loophole is still there, and those writes still alias “anything” - it’s just that there isn’t anything else interesting to alias in this function … yet. As soon as you add more to the function, it bytes us again. A random example. You’ll be fighting the compiler to the end as long as you have writes to uint8_t arrays in your core loops.

Feedback of any type is welcome. I don’t have a comments system yet, so as usual I’ll outsource discussion to this HackerNews thread.

If you liked this post, check out the homepage for others you might enjoy.



