Recently, Daniel Lemire tackled the topic of selecting N distinct numbers at random. In the case we want sorted output, an obvious solution presents itself: sorting randomly chosen values and de-duplicating the list, which is easy since identical values are now adjacent.

While Daniel suggests a clever method of avoiding a sort entirely, I’m also interested in they why for the underlying performace of the sort method: it takes more than 100 ns per element, which means 100s of CPU clock cycles and usually even more instructions than that (on a superscalar processor)! As a sanity check, a quick benchmark ( perf record ./bench && perf report ) shows that more than 90% of the time spent in this approach is in the sorting routine, qsort - so we are right to focus on this step, rather than say the de-duplication step or the initial random number generation. This naturally, this raises the question: how fast is qsort when it comes to sorting integers and can we do better?

All of the code for this post is available on GitHub, so if you’d like to follow along with the code open in an editor, go right ahead (warning: there are obviously some spoilers if you dig through the code first).

Benchmarking Qsort

First, let’s take a look at what qsort is doing, to see if there is any delicous low-hanging performance fruit. We use perf record ./bench qsort to capture profiling data, and perf report --stdio to print a summary:

# Samples: 101K of event 'cycles:ppp' # Event count (approx.): 65312285835 # # Overhead Command Shared Object Symbol # ........ ....... ................. .............................................. # 64.90% bench libc-2.23.so [.] msort_with_tmp.part.0 21.45% bench bench [.] compare_uint64_t 8.65% bench libc-2.23.so [.] __memcpy_sse2 0.87% bench libc-2.23.so [.] __memcpy_avx_unaligned 0.83% bench bench [.] main 0.41% bench [kernel.kallsyms] [k] clear_page_erms 0.34% bench [kernel.kallsyms] [k] native_irq_return_iret 0.31% bench bench [.] bench_one

The assembly for the biggest offender, msort_with_tmp looks like this :

Percent | Address | Disassembly -------------------------------------------------- 30.55 : 39200: mov rax,QWORD PTR [r15] 0.61 : 39203: sub rbp,0x1 0.52 : 39207: add r15,0x8 7.30 : 3920b: mov QWORD PTR [rbx],rax 0.39 : 3920e: add rbx,0x8 0.07 : 39212: test r12,r12 0.09 : 39215: je 390e0 ; merge finished 1.11 : 3921b: test rbp,rbp 0.01 : 3921e: je 390e0 ; merge finished 5.24 : 39224: mov rdx,QWORD PTR [rsp+0x8] 0.42 : 39229: mov rsi,r15 0.19 : 3922c: mov rdi,r13 6.08 : 3922f: call r14 0.59 : 39232: test eax,eax 3.52 : 39234: jg 39200 32.69 : 39236: mov rax,QWORD PTR [r13+0x0] 1.31 : 3923a: sub r12,0x1 1.01 : 3923e: add r13,0x8 1.09 : 39242: jmp 3920b <bsearch@@GLIBC_2.2.5+0x205b>

Depending on your level of assembly reading skill, it may not be obvious, but this is basically a classic merge routine: it is merging two lists by comparing the top elements of each list (pointed to by r13 and r15 ), and then storing the smaller element (the line QWORD PTR [rbx],rax ) and loading the next element from that list. There are also two checks for termination ( test r12,r12 and test rbp,rbp ). This hot loop corresponds directly to this code from glibc (from the file msort.c ) :

while ( n1 > 0 && n2 > 0 ) { if (( * cmp ) ( b1 , b2 , arg ) <= 0 ) { * ( uint64_t * ) tmp = * ( uint64_t * ) b1 ; b1 += sizeof ( uint64_t ); -- n1 ; } else { * ( uint64_t * ) tmp = * ( uint64_t * ) b2 ; b2 += sizeof ( uint64_t ); -- n2 ; } tmp += sizeof ( uint64_t ); }

This loop suffers heavily from branch mispredictions, since the “which element is larger” branch is highly unpredictable (at least for random-looking input data). Indeed, we see roughly 128 million mispredicts while sorting ~11 million elements: close to 12 mispredicts per element.

We also note the presence of the indirect call at the call r14 line. This corresponds to the (*cmp) (b1, b2, arg) expression in the source: it is calling the user provided comparator function through a function pointer. Since the qsort() code is compiled ahead of time and is found inside the shared libc binary, there is no chance that the comparator, passed as a function pointer, can be inlined.

The comparator function I provide looks like:

int compare_uint64_t ( const void * l_ , const void * r_ ) { uint64_t l = * ( const uint64_t * ) l_ ; uint64_t r = * ( const uint64_t * ) r_ ; if ( l < r ) return - 1 ; if ( l > r ) return 1 ; return 0 ; }

which on gcc compiles to branch-free code:

mov rax,QWORD PTR [rsi] mov edx,0xffffffff cmp QWORD PTR [rdi],rax seta al movzx eax,al cmovb eax,edx ret

Note that the comparator has to redundantly load from memory the two locations to compare, something the merge loop already did (the merge loop reads them because it is responsible for moving the elements).

How much better could things get if we inline the comparator into the merge loop? That’s what we do in qsort-inlined , and here’s the main loop which now includes the comparator function :

0.07 : 401dc8: test rbp,rbp 0.66 : 401dcb: je 401e0c <void msort_with_tmp<CompareU64>(msort_param const*, void*, unsigned long, CompareU64)+0xbc> 3.51 : 401dcd: mov rax,QWORD PTR [r9] 5.00 : 401dd0: lea rdx,[rbx+0x8] 1.62 : 401dd4: mov rcx,QWORD PTR [rbx] 0.24 : 401dd7: lea r8,[r9+0x8] 6.96 : 401ddb: cmp rax,rcx 20.83 : 401dde: cmovbe r9,r8 8.88 : 401de2: cmova rbx,rdx 0.27 : 401de6: cmp rcx,rax 6.23 : 401de9: sbb r8,r8 0.74 : 401dec: cmp rcx,rax 4.93 : 401def: sbb rdx,rdx 0.24 : 401df2: not r8 6.69 : 401df5: add rbp,rdx 0.44 : 401df8: cmp rax,rcx 5.34 : 401dfb: cmova rax,rcx 5.96 : 401dff: add rdi,0x8 7.48 : 401e03: mov QWORD PTR [rdi-0x8],rax 0.00 : 401e07: add r15,r8 0.71 : 401e0a: jne 401dc8 <void msort_with_tmp<CompareU64>(msort_param const*, void*, unsigned long, CompareU64)+0x78>

A key differences is that the loop is now basically branch free. There are still two conditional jumps, but they are both just checking for the termination condition (that one of the lists to merge is exhausted), so we expect this loop to be free of branch mispredictions other than the final iteration. Indeed, we measure with perf stat that the misprediction rate has dropped from to close to 12 mispredicts per element to around 0.75 per element. The loop has only two loads and one store, so the memory access redundancy between the merge code and the comparator has been eliminated. Finally, the comparator does a three-way compare (returning distrinct results for < , > and == ), but the merge code only needs a two-way compare ( <= or > ) - inlining the comparator manages to remove extra code associated with distinguishing the < and == cases.

What’s the payoff? It’s pretty big:

The speedup hovers right around 1.77x. Note that this is much larger than simply eliminating all the time spent in the separate comparator function in the original version (about 17% of the time implying a speedup of 1.2x if all the function time disapeared). This is a good example of how inlining isn’t just about removing function call overhead but enabling further knock on optimizations which can have a much larger effect than just removing the overhead associated with function calls.

What about C++?

Short of copying the existing glibc (note: LGPL licenced) sorting code to allow inlining, what else can we do to speed things up? I’m writing in C++, so how about the C++ sort functions available in the <algorithm> header? Unlike C’s qsort which is generic by virtue of taking a function pointer and information about the object size, the C++ sort functions use templates to achieve genericity and so are implemented directly in header files. Since the sort code and the comparator are being compiler together, we expect the comparator to be easily inlined, and perhaps other optimizations may occur.

Without further ado, let’s just throw std::sort , std::stable_sort and std::partial_sort into the mix:

The C++ sort functions, other than perhaps std::partial_sort , put in a good showing. It is interesting that std::stable_sort which has stricly more requirements on its implementation than std::sort (i.e., any stable sort is also suitable for std::sort ) ends up faster. I re-wrote this paragaph several times, since sometimes after a reboot stable_sort was slower and sometimes it was faster (as shown above). When it was “fast” it had less than 2% branch mispredictions, and when it was slow it was at 15%. So perhaps there was some type of aliasing issue in the branch predictor which depends on the physical addresses assigned, which can vary from run to run, I’m not sure. See for an old note from when std::stable_sort was slower.

Can we do better?

So that’s as fast as it gets, right? We aren’t going to beat std::sort or std::stable_sort without a huge amount of effort, I think? After all, these are presumably highly optimized sorting routines written by the standard library implementors. Sure, we might expect to be able to beat qsort() , but that’s mostly because of built-in disadvantages that qsort has, lacking the ability to inline the comparator, etc.

Radix Sort Attempt 1

Well, one thing we can try is a non-comparison sort. We know we have integer keys, so why stick to comparing numbers pairwise - maybe we can use something like radix sort to stick them directly in their final location.

We can pretty much copy the description from the wikipedia article into C++ code that looks like this:

const size_t RADIX_BITS = 8 ; const size_t RADIX_SIZE = ( size_t ) 1 << RADIX_BITS ; const size_t RADIX_LEVELS = ( 63 / RADIX_BITS ) + 1 ; const uint64_t RADIX_MASK = RADIX_SIZE - 1 ; using queuetype = std :: vector < uint64_t > ; void radix_sort1 ( uint64_t * a , size_t count ) { for ( size_t pass = 0 ; pass < RADIX_LEVELS ; pass ++ ) { uint64_t shift = pass * RADIX_BITS ; std :: array < queuetype , RADIX_SIZE > queues ; // copy each element into the appropriate queue based on the current RADIX_BITS sized // "digit" within it for ( size_t i = 0 ; i < count ; i ++ ) { size_t value = a [ i ]; size_t index = ( value >> shift ) & RADIX_MASK ; queues [ index ]. push_back ( value ); } // copy all the queues back over top of the original array in order uint64_t * aptr = a ; for ( auto & queue : queues ) { aptr = std :: copy ( queue . begin (), queue . end (), aptr ); } } }

That’s about as simple as it gets. We decide to use one byte (i.e., radix-256) as the size of our “digit” (although it’s easy to change by adjusting the RADIX_BITS constant) and so we make 8 passes over our uint64_t array from the least to most significant byte. At each pass we assign the current value to one of 256 “queues” (vectors in this case) based on the value of the current byte, and once all elements have been processed we copy each queue in order back to the original array. We’re done - the list is sorted.

How does it perform against the usual suspects?

Well it’s not terrible, and while it certainly has some issues at low element counts, it actaully squeezes into first place at 1,000,000 elements and is competitive at 100,000 and 10,000,000. Not bad for a dozen lines of code.[^rewrite]

A quick check of time ./bench Radix1 shows something interesting:

real 0m1.099s user 0m0.552s sys 0m0.548s

We spent about the same amount of time in the kernel ( sys time) as in user space. The other algorithms spend only a few lonely % in the kernel, and almost most of that is in the setup code, not in the actual sort.

A deeper look with perf record && perf report shows:

# Samples: 4K of event 'cycles:ppp' # Event count (approx.): 2858148287 # # Overhead Command Shared Object Symbol # ........ ....... ................... ................................................ # 29.02% bench bench [.] radix_sort1 26.16% bench libc-2.23.so [.] __memmove_avx_unaligned 4.93% bench [kernel.kallsyms] [k] clear_page_erms 4.46% bench [kernel.kallsyms] [k] native_irq_return_iret 3.61% bench [kernel.kallsyms] [k] error_entry 3.04% bench [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode 2.99% bench [kernel.kallsyms] [k] sync_regs 1.94% bench bench [.] main 1.91% bench [kernel.kallsyms] [k] get_page_from_freelist 1.64% bench libc-2.23.so [.] __memcpy_avx_unaligned 1.59% bench [kernel.kallsyms] [k] release_pages 1.40% bench [kernel.kallsyms] [k] __handle_mm_fault 1.37% bench [kernel.kallsyms] [k] _raw_spin_lock 1.01% bench [kernel.kallsyms] [k] __pagevec_lru_add_fn 0.88% bench [kernel.kallsyms] [k] handle_mm_fault 0.86% bench [kernel.kallsyms] [k] __alloc_pages_nodemask 0.83% bench [kernel.kallsyms] [k] unmap_page_range 0.75% bench [kernel.kallsyms] [k] try_charge 0.74% bench [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.70% bench bench [.] bench_one 0.63% bench [kernel.kallsyms] [k] __do_page_fault 0.63% bench [kernel.kallsyms] [k] __mod_zone_page_state 0.49% bench [kernel.kallsyms] [k] free_pcppages_bulk 0.45% bench [kernel.kallsyms] [k] page_add_new_anon_rmap 0.43% bench [kernel.kallsyms] [k] up_read 0.40% bench [kernel.kallsyms] [k] page_remove_rmap 0.36% bench [kernel.kallsyms] [k] __mod_node_page_state

I’m not even going to try to explain what __pagevec_lru_add_fn does, but the basic idea here is that we are spending a lot of time in the kernel, and we are doing that because we are allocating and freeing a lot of memory. Every pass we push_back every element into one of 256 vectors, which will be constantly growing to accomodate new elements, and then finally all the now-giant vectors are freed at the end of every allocation. That’s a lot of stress on the memory allocation paths in the kernel.

Radix Sort Attempt 2

Let’s try the first-thing-you-do-when-vector-is-involved-and-performance-matters; that is, let us reserve() memory for each vector before we start adding elements. Just throw this at the start of each pass:

for ( auto & queue : queues ) { queue . reserve ( count / RADIX_SIZE * 1.2 ); }

Here, 1.2 is an arbitrary fudge factor to account for the fact that some vectors will get more than the average number of elements. The exact value doesn’t matter much as long as it’s not too small (0.9 is a bad value, almost evey vector needs a final doubling). This gives use radix_sort2 and let’s jump straight to the results (I’ve removed a couple of the less interesting sorts to reduce clutter):

I guess it’s a bit better? It does better for small array sizes, probably because the overhead of constantly resizing the small vectors is more significant there, but it is actually a bit slower for the middle sizes. System time is lower but still quite high:

real 0m0.904s user 0m0.523s sys 0m0.380s

What we really want is to stop throwing away the memory we allocated every pass. Let’s move the queues outside of the loop and just clear them every iteration:

void radix_sort3 ( uint64_t * a , size_t count ) { using queuetype = std :: vector < uint64_t > ; std :: array < queuetype , RADIX_SIZE > queues ; // we keep the reservation code (now outside the loop), // although it matters less now since the resizing will // generally only happen in the first iteration for ( auto & queue : queues ) { queue . reserve ( count / RADIX_SIZE * 1.2 ); } for ( size_t pass = 0 ; pass < RADIX_LEVELS ; pass ++ ) { // ... as before // copy all the queues back over top of the original array in order uint64_t * aptr = a ; for ( auto & queue : queues ) { aptr = std :: copy ( queue . begin (), queue . end (), aptr ); queue . clear (); // <--- this is new, clear the queues } } }

Yes another graph with Radix3 included this time:

That looks a lot better! This radix sort is always faster than our earlier attempts and the fastest overall for sizes 10,000 and above. It still falls behind the std:: algorithms for the 1,000 element size, where the O(n) vs O(n*log(n)) difference doesn’t play as much of a role. Despite this minor victory, and while system is reduced, we are still spending 30% of our time in the kernel:

real 0m0.612s user 0m0.428s sys 0m0.184s

Pre-sizing the Queues

Sorting should be about my code, not the kernel - so let’s get rid of this kernel time for good.

To do that, we’ll move away from std::vector entirely and just allocate one large temporary region for all of our queues. Although we know the total final size of all the queues (it’s the same size as the input array), we don’t know how big any particular queue will be. This means we don’t know exactly how to divide up the region. A well-known solution to this problem is to first count the number of number of values that will fall into each queue so they can be sized appropriately (also known as taking the histogram of the data). As a bonus, we can count the frequencies for all radix passes in a single trip over the data, so we expect this part to be much cheaper than the radix sort proper which needs a separate pass for each “digit”.

Knowing the size of each queue allows us to pack all the values exactly within a single temporary region. The copy at the end of each stage is just a single linear copy. The code is longer now since we need to implement the frequency counting gives us radix_sort4. Results:

It’s a significant speedup over Radix 3, especially at small sizes (speedup about 3x) but still good at large sizes (about 1.3x for 10m elements). The speedup over poor qsort ranges from 3.7x to 5.45x, increasing at larger sizes. Even compared to the best contented from the standard library, std::stable_sort , the speedup averages about 2x.

Are We Done Yet?

So are we done yet? Can we squeeze out some more performance?

One little trick is to note that the temporary “queue area” and the original array are now of the same size and type, so rather than always performing the radix passes from the original array to the temporary area (which requires a copy back to the original array each time), we can instead copy back and forth between these two areas, alternating the “from” and “to” areas each time. This saves a copy each pass.

The results in radix_sort5 and it provides a small but measurable benefit:

It’s also interesting how small the improvement is. This change actually cuts the memory bandwidth requirements of the algorithm almost exactly in half: rather than reading and writing each element twice during each pass (once during the sort and once in the final copy), we read them only once (it’s not exactly half because the single histogramming pass adds another read). Yet the overall speedup is small, in the range of 1.05x to 1.2x. From this we can conclude that we are not approaching the memory bandwidth limits in the radix passes.

There is a catch here: at the end of the sort, if we have done an odd number of passes, the final sorted results will be in the temporary area, not in the original array, so we need to copy back to the original array - but 1 extra copy is better than 8! In any case, with RADIX_BITS == 8 as we’ve chosen, there are an even number of copies, so this code never executes in our benchmark.

Pointless Work is Pointless

Another observation we can make is that for this input (and many inputs in the real world), many of the radix passes do nothing. All the input values are less than 40,000,000,000. In 64-bit hex that looks like 0x00000009502F9000 - the top 28 bits are always zero. Any radix pass that uses these all-zero bits is pointless: every element will be copied to the first queue entry, one by one: essentially it’s a slow, convoluted memcpy .

We can simply skip these “trivial” passes by examining the frequency count: if all counts are zero except a single entry, the pass does nothing. This gives us radix_sort6, which ends up cutting out 3 of the 8 radix passes leading to performance like this (I’ve changed the scale to emphasize the faster algorithms as they were getting crowed down at the bottom):

In relative terms this provides a significant speedup ranging from 1.2x to 1.5x over radix_sort5 . The theoretical speedup from skipping 3 of the 8 passes is 1.6x, but we don’t achieve that because there is work outside of the core passes (counting the frequencies, for example) and also because the 3 trivial passes were actually slightly faster than the non-trivial ones because of better caching behavior.

Unpointless Prefetch

So how much more juice can we squeeze from this performance orange? Will this post ever come to an end? Has anyone even made it this far?

As it turns out we’re not done yet, and the next change is perhaps the easiest one yet, a one-liner. First, let us observe that the core radix sort loop does a linear read through the elements (very prefetch and cacheline locality friendly), and then a scattered store to one of 256 locations depending on the value. We might expect that those scattered stores are problematic, since we don’t expect the prefetcher to track 256 different streams. On the other hand, stores are somewhat “fire and forget”, because after we execute the store, they can just sit around in the store buffer for as long as needed while the associated cache lines are fetched: they don’t participate in any dependency chains. So maybe they aren’t causing a problem?

Let’s check that theory using the resource_stalls.sb event, which tells us how many cycles we stalled store buffer was full, using this magical invocation:

for i in {3..8}; do s=$((10**$i)); rm -f bench.o; echo "SIZE=$s, KB=$(($s*8/1000))"; make EFLAGS=-DSIZE=$s; perf stat -e cycles,instructions,resource_stalls.any,resource_stalls.sb ./bench Radix6; done

This tests a variety of different sizes and here’s typical output when the array to sort has 1 million elements (8 MB):

SIZE=1000000, KB=8000 ... Performance counter stats for './bench Radix6': 5,544,710,461 cycles 8,219,301,287 instructions # 1.48 insn per cycle 2,917,340,919 resource_stalls.any 2,454,399,834 resource_stalls.sb

out of 5.54 billion cycles, we are stalled because the store buffer is full in 2.45 billion of them. So … a lot.

One fix for this is a one-liner in the main radix-sort loop:

for (size_t i = 0; i < count; i++) { size_t value = from[i]; size_t index = (value >> shift) & RADIX_MASK; *queue_ptrs[index]++ = value; __builtin_prefetch(queue_ptrs[index] + 1); // <-- the magic happens here }

That’s it. We prefetch the next plus one position in the queue after writing to the queue. 87.5% of the time this does nothing, since the next position is either in the same cache line (6 out of 8 times) or we already prefetched it the last time we wrote to this queue (1 out of 8 times).

The other 12.5% of the time it helps, producing results like this:

The speedup is zero at the smallest size (1000 elements, aka 8 KB) which fits in L1, but ranges between 1.31x and 1.45x as soon as the data set exceeds L1. In principle, I wouldn’t expect prefetch to help here: we expect it to help for loads if we can start the load early, but for stores, with a full store buffer the CPU can already pick from 50+ stores to start prefetching. That is, the CPU already knows what stores are coming becaue the store buffer is full of them. However, in practice, theory and practice are different and in particular Intel CPUs seem to struggle when loads that hit in L1 are interleaved with loads that don’t, especially with recent microcode.

That interleaved scenario will happen all the time with this type of scattered write pattern, and as noted in the earlier post, prefetch is one way of mitigating this. For the 8 MB working set we now have the following perf stat results:

SIZE=1000000, KB=8000 ... Performance counter stats for './bench Radix7': 4,342,298,986 cycles 8,719,315,140 instructions # 2.01 insn per cycle 1,555,574,009 resource_stalls.any 623,342,154 resource_stalls.sb 1.675302704 seconds time elapsed

About a four-fold reduction in store-buffer stalls. Reductions are even more dramatic for other sizes: the 100,000 element (8 KB) size has an even larger reduction, from being stalled 43% of the time down to 5% after prefetching is added.

What’s Next?

What’s next is that I have to eat. So while we’re not done here yet, the remaining part of this trip down the radix sort hole will have to wait for part 2.

If you liked this post, check out the homepage for others you might enjoy.

Boxing photo by Hermes Rivera on Unsplash .



