Challenge your performance intuition with nanosecond sorting

In the physical world, nanotechnology predictably brings us surprises. Being structured on nano-scale, the most common materials, such as carbon, iron or copper, gain new unforeseen properties. What we know about them on the macro-scale do not apply anymore.

Surprisingly, a very similar effect happens in computation. If the operation you want to speed-up already runs in a few nano-seconds, your reasoning about algorithmic complexity probably wouldn't apply.

The most effective algorithms become mediocre while the useless rise from the oblivion to shine and amaze. One of these algorithms is the index sort. It has unconditional O(n2) complexity and this makes it probably the least performant sorting algorithm you never heard about. It's useless on the macro-scale. But on the nano-scale, it excels.

Index sort

The idea of index sort is this. For every item in the assorted array, you can find its index in the sorted array by comparing it to all the others.

Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > ? 2 >= > > ? 4 >= >= > ? 2 >= >= >= ? X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > ? 2 >= > > ? 4 >= >= > ? 2 >= >= >= ? Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 1+? 2 >= > > ? 4 >= >= > ? 2 >= >= >= ? X (assorted): 3 2 4 2 Xi Comparisons Index 34 2> > 1+? 2 >= > > ? 4 >= >= > ? 2 >= >= >= ? Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > ? 4 >= >= > ? 2 >= >= >= ? X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 42 >= > > ? 4 >= >= > ? 2 >= >= >= ? Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > 0 4 >= >= > ? 2 >= >= >= ? X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 32 2 >= > >4 >= >= > ? 2 >= >= >= ? Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > 0 4 >= >= > 1+? 2 >= >= >= ? X (assorted): 3 2 4 2 Xi Comparisons Index2 4 2 32 2 >= > > 0>= > 1+? 2 >= >= >= ? Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > 0 4 >= >= > 2+? 2 >= >= >= ? X (assorted): 3 2 4 2 Xi Comparisons Index 34 2 32 2 >= > > 0> 2+? 2 >= >= >= ? Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > 0 4 >= >= > 3 2 >= >= >= ? X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 42 2 >= > > 02 >= >= >= ? Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > 0 4 >= >= > 3 2 >= >= >= 1 X (assorted): 3 2 4 2 Xi Comparisons Index 34 2 32 2 >= > > 0 4>=>= Example X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > 0 4 >= >= > 3 2 >= >= >= 1 Xs (sorted): 2 2 3 4 (Xs[2] = 3, Xs[0] = 2, Xs[3] = 8, Xs[1] = 2) X (assorted): 3 2 4 2 Xi Comparisons Index 3 2 4 2 3 > > > 2 2 >= > > 0 4 >= >= > 3 2 >= >= >= 1 Xs (sorted):

Next step

It's not algorithmically elegant. But it's deterministic meaning that you can avoid branching completely. On small arrays, this should be a noticeable advantage.

And now for the challenge

Let's compare how different sorting works on triplets — arrays of three elements. You might think that this is a rare task but when you work with 3D graphics it comes up surprisingly often. Last year alone I did triplet sorting for triangles' indices, for cubic equation roots, and for axis-oriented bounding box' dimensions.

Tasks like that are rarely bottlenecks but if you can speed up your algorithm for even 12% with no measurable effort, it's probably worth doing.

Anyway, here is my benchmark.

constexpr size_t samples = 10'000'000; std::array<std::array<int, 3>, samples> g_data; static void ResetData() { std::mt19937 rng(0); std::uniform_int_distribution<int> random_number(1, 100); for (auto& numbers : g_data) { numbers[0] = random_number(rng); numbers[1] = random_number(rng); numbers[2] = random_number(rng); } } int CheckDataForMissorts() { int missorts = 0; for (auto& numbers : g_data) { if(numbers[1] < numbers[0] || numbers[2] < numbers[1]) missorts++; } return missorts; } #define MEASURE(CODE_TO_MEASURE, NAME_TO_PRINT) \ { \ ResetData(); \ auto start = std::chrono::system_clock::now(); \ for(auto& t : g_data) { \ CODE_TO_MEASURE \ } \ auto end = std::chrono::system_clock::now(); \ std::chrono::duration<double> time = end - start; \ std::cout << time.count() << " - " << NAME_TO_PRINT; \ std::cout << "

"; \ std::cout << "missorts: " << CheckDataForMissorts(); \ std::cout << "



"; \ }

It generates a lot of triplets, runs a piece of code a lot of times, and reports the total running time.

All the pieces of code are compiled with GCC 5.4.0: g++ -std=c++14 -O3, and measured on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz.

Everything (spoilers including) is availabe on GitHub.

Using your intuition and best judgment, please estimate their relative performance. Please use the slider below the code samples.

Round 1.

The main question, the question #1 that has to be answered before anything else: why don't you use std::sort?

std::sort(t.begin(), t.end()); const auto a = t[0]; const auto b = t[1]; const auto c = t[2]; t[int(a > b) + int(a > c)] = a; t[int(b >= a) + int(b > c)] = b; t[int(c >= a) + int(c >= b)] = c;

Reveal the truth

Index sort is five times faster in this context. But in a different context, it might not work all that well. For instance, on ARMv7 (v7l) it's only 2 times faster. And with clang (Debian clang 3.5.0) it's even less drastic. Clang's sort seems to be slightly better on small arrays generally. On my testing Intel i7 it's also 20% faster than the GCC version. Still, not enough to make it interesting so let's move along with the index sort.

Round 2.

Sorting triplets is fun but can we generalise it with no additional overhead?

const auto a = t[0]; const auto b = t[1]; const auto c = t[2]; t[int(a > b) + int(a > c)] = a; t[int(b >= a) + int(b > c)] = b; t[int(c >= a) + int(c >= b)] = c; template <size_t N> void index_sort(std::array<int, N>& t) { std::array<int, N> a = t; for(auto i = 0u; i < N; ++i) { auto k = 0u; for(auto j = 0u; j < N; ++j) { if(j > i) k += int(a[i] > a[j]); else if(j < i) k += int(a[i] >= a[j]); } t[k] = a[i]; } }

Reveal the truth

Yes, we can. And even with... negative overhead? godbolt doesn't show any conceptual advantage. mov eax, DWORD PTR [rdi] mov ecx, DWORD PTR [rdi+4] xor r8d, r8d mov edx, DWORD PTR [rdi+8] cmp eax, ecx setg r8b xor esi, esi cmp eax, edx setg sil add esi, r8d xor r8d, r8d cmp eax, ecx movsx rsi, esi setle r8b mov DWORD PTR [rdi+rsi*4], eax xor esi, esi cmp ecx, edx setg sil add esi, r8d movsx rsi, esi mov DWORD PTR [rdi+rsi*4], ecx xor esi, esi cmp eax, edx setle sil xor eax, eax cmp ecx, edx setle al add eax, esi cdqe mov DWORD PTR [rdi+rax*4], edx ret mov rax, QWORD PTR [rdi] mov ecx, DWORD PTR [rdi+8] xor esi, esi mov edx, eax shr rax, 32 cmp edx, eax setg sil xor r8d, r8d cmp edx, ecx setg r8b add esi, r8d mov DWORD PTR [rdi+rsi*4], edx xor esi, esi cmp edx, eax setle sil xor r8d, r8d cmp ecx, eax setl r8b add esi, r8d cmp edx, ecx setle dl cmp ecx, eax mov DWORD PTR [rdi+rsi*4], eax setge al movzx edx, dl movzx eax, al add eax, edx mov DWORD PTR [rdi+rax*4], ecx ret The code is not drastically different and it has no reason to be. My theory is that this gain is accidental. The most surprisingly, it reproduces well across compilers and CPUs.

Round 3.

For a triplet, there is another std::sort substitude and it's just three conditional swaps.

template <size_t N> void index_sort(std::array<int, N>& t) { std::array<int, N> a = t; for(auto i = 0u; i < N; ++i) { auto k = 0u; for(auto j = 0u; j < N; ++j) { if(j > i) k += int(a[i] > a[j]); else if(j < i) k += int(a[i] >= a[j]); } t[k] = a[i]; } } void swap_sort(std::array<int, 3>& t) { auto sort_ = [](auto& a, auto& b) { if (a > b) std::swap(a, b); }; sort_(t[0], t[1]); sort_(t[1], t[2]); sort_(t[0], t[1]); }

Reveal the truth

They still work faster than std::sort but not as fast as the index sort. However, that's on modern Intel. On ARMv7 branching doesn't make all that difference and the measurements are almost identical. The code is pretty straightforward. mov rax, QWORD PTR [rdi] mov ecx, DWORD PTR [rdi+8] xor esi, esi mov edx, eax shr rax, 32 cmp edx, eax setg sil xor r8d, r8d cmp edx, ecx setg r8b add esi, r8d mov DWORD PTR [rdi+rsi*4], edx xor esi, esi cmp edx, eax setle sil xor r8d, r8d cmp ecx, eax setl r8b add esi, r8d cmp edx, ecx setle dl cmp ecx, eax mov DWORD PTR [rdi+rsi*4], eax setge al movzx edx, dl movzx eax, al add eax, edx mov DWORD PTR [rdi+rax*4], ecx ret mov eax, DWORD PTR [rdi] mov edx, DWORD PTR [rdi+4] cmp eax, edx jle .L2 mov ecx, DWORD PTR [rdi+8] mov DWORD PTR [rdi], edx mov DWORD PTR [rdi+4], eax cmp ecx, eax jge .L1 mov esi, edx mov edx, eax mov eax, esi .L6: cmp eax, ecx mov DWORD PTR [rdi+4], ecx mov DWORD PTR [rdi+8], edx jle .L1 mov DWORD PTR [rdi], ecx mov DWORD PTR [rdi+4], eax ret .L2: mov ecx, DWORD PTR [rdi+8] cmp edx, ecx jg .L6 .L1: rep ret It's three pairs of jump and compare.

Round 4.

What if we do std::min/std::max instead of conditional swaps?

void swap_sort(std::array<int, 3>& t) { auto sort_ = [](auto& a, auto& b) { if (a > b) std::swap(a, b); }; sort_(t[0], t[1]); sort_(t[1], t[2]); sort_(t[0], t[1]); } void swap_sort_no_ifs (std::array<int, 3>& t) { auto sort_ = [](auto& a, auto& b) { const auto temp = std::min(a, b); b = std::max(a, b); a = temp; }; sort_(t[0], t[1]); sort_(t[1], t[2]); sort_(t[0], t[1]); }

Reveal the truth

With GCC, there is virtually no difference. mov eax, DWORD PTR [rdi] mov edx, DWORD PTR [rdi+4] cmp eax, edx jle .L2 mov ecx, DWORD PTR [rdi+8] mov DWORD PTR [rdi], edx mov DWORD PTR [rdi+4], eax cmp ecx, eax jge .L1 mov esi, edx mov edx, eax mov eax, esi .L6: cmp eax, ecx mov DWORD PTR [rdi+4], ecx mov DWORD PTR [rdi+8], edx jle .L1 mov DWORD PTR [rdi], ecx mov DWORD PTR [rdi+4], eax ret .L2: mov ecx, DWORD PTR [rdi+8] cmp edx, ecx jg .L6 .L1: rep ret mov edx, DWORD PTR [rdi+4] mov eax, DWORD PTR [rdi] cmp edx, eax jl .L8 mov esi, eax jle .L2 mov ecx, DWORD PTR [rdi+8] cmp ecx, edx jl .L10 .L15: mov eax, edx jle .L4 .L5: cmp eax, esi mov DWORD PTR [rdi+8], ecx jl .L12 mov edx, esi jle .L6 mov DWORD PTR [rdi+4], eax mov DWORD PTR [rdi], edx ret .L12: mov edx, eax .L6: mov eax, esi mov DWORD PTR [rdi], edx mov DWORD PTR [rdi+4], eax ret .L8: mov esi, edx .L2: mov ecx, DWORD PTR [rdi+8] mov edx, eax cmp ecx, edx jge .L15 .L10: mov eax, ecx .L4: mov ecx, edx jmp .L5 Again, the code is accidentally different but it's essentially the same thing. However, with Clang the difference is obvious. mov ecx, dword ptr [rdi] mov edx, dword ptr [rdi + 4] cmp ecx, edx jle .LBB0_1 mov dword ptr [rdi], edx mov dword ptr [rdi + 4], ecx mov eax, edx jmp .LBB0_3 .LBB0_1: mov eax, ecx mov ecx, edx .LBB0_3: mov edx, dword ptr [rdi + 8] cmp ecx, edx jle .LBB0_4 mov dword ptr [rdi + 4], edx mov dword ptr [rdi + 8], ecx jmp .LBB0_6 .LBB0_4: mov edx, ecx .LBB0_6: cmp eax, edx jle .LBB0_8 mov dword ptr [rdi], edx mov dword ptr [rdi + 4], eax .LBB0_8: ret mov eax, dword ptr [rdi] mov ecx, dword ptr [rdi + 4] cmp ecx, eax mov edx, eax cmovle edx, ecx cmp eax, ecx cmovl eax, ecx mov ecx, dword ptr [rdi + 8] cmp ecx, eax mov esi, eax cmovle esi, ecx cmp eax, ecx cmovl eax, ecx mov dword ptr [rdi + 8], eax cmp esi, edx mov eax, edx cmovle eax, esi cmp edx, esi cmovge esi, edx mov dword ptr [rdi + 4], esi mov dword ptr [rdi], eax ret The min/max version is branchless! Clang uses the conditional moves instead of conditional jumps and this makes the code on the right five times faster on the very same CPU.

Round 5.

You don't have to rely on the compiler to make the deterministic conditional swap. There is an arithmetic trick that lets you sort a pair of numbers.

A half-sum of a pair minus half-difference is the lowest number; a half-sum plus a half-difference is the greatest. Of course, they are the same when their half-difference is 0.

This limits the application of this sorting. With integers, you can accidentally overflow, and with floating-point numbers, you can lose precision if the numbers have different exponent. But let's just presume this works for us. Let's see which is faster.

void swap_sort_no_ifs (std::array<int, 3>& t) { auto sort_ = [](auto& a, auto& b) { const auto temp = std::min(a, b); b = std::max(a, b); a = temp; }; sort_(t[0], t[1]); sort_(t[1], t[2]); sort_(t[0], t[1]); } void swap_sort_arithmetic_hack (std::array<int, 3>& t) { auto sort_ = [](auto& a, auto& b) { auto sum = a+b; auto diff = std::abs(a-b); a = (sum - diff) / 2; b = (sum + diff) / 2; }; sort_(t[0], t[1]); sort_(t[1], t[2]); sort_(t[0], t[1]); }

Reveal the truth

The code is a little excessive but it is deterministic. It is branchless. mov edx, DWORD PTR [rdi+4] mov eax, DWORD PTR [rdi] cmp edx, eax jl .L8 mov esi, eax jle .L2 mov ecx, DWORD PTR [rdi+8] cmp ecx, edx jl .L10 .L15: mov eax, edx jle .L4 .L5: cmp eax, esi mov DWORD PTR [rdi+8], ecx jl .L12 mov edx, esi jle .L6 mov DWORD PTR [rdi+4], eax mov DWORD PTR [rdi], edx ret .L12: mov edx, eax .L6: mov eax, esi mov DWORD PTR [rdi], edx mov DWORD PTR [rdi+4], eax ret .L8: mov esi, edx .L2: mov ecx, DWORD PTR [rdi+8] mov edx, eax cmp ecx, edx jge .L15 .L10: mov eax, ecx .L4: mov ecx, edx jmp .L5 mov edx, DWORD PTR [rdi+4] mov eax, DWORD PTR [rdi] lea ecx, [rax+rdx] sub eax, edx cdq xor eax, edx sub eax, edx mov edx, ecx sub edx, eax add eax, ecx mov ecx, eax mov esi, edx shr ecx, 31 shr esi, 31 add eax, ecx mov ecx, DWORD PTR [rdi+8] add edx, esi sar eax sar edx lea r8d, [rcx+rax] sub eax, ecx mov ecx, eax sar ecx, 31 xor eax, ecx sub eax, ecx mov ecx, r8d sub ecx, eax add eax, r8d mov esi, ecx shr esi, 31 add esi, ecx mov ecx, eax shr ecx, 31 sar esi add eax, ecx lea ecx, [rsi+rdx] sar eax mov DWORD PTR [rdi+8], eax mov eax, edx sub eax, esi cdq xor eax, edx sub eax, edx mov edx, ecx sub edx, eax mov esi, edx shr esi, 31 add edx, esi sar edx add eax, ecx mov DWORD PTR [rdi], edx mov edx, eax shr edx, 31 add eax, edx sar eax mov DWORD PTR [rdi+4], eax ret Since we're already knee-deep into limitations, why don't we try a truly integer-specific hack?

Round 6.

I'll be honest, I stole this from somewhere on the Internet and I don't know how exactly this works. Apparently, it works just fine on an integer half-range so it's also limited.

But would it be faster?

void swap_sort_arithmetic_hack (std::array<int, 3>& t) { auto sort_ = [](auto& a, auto& b) { auto sum = a+b; auto diff = std::abs(a-b); a = (sum - diff) / 2; b = (sum + diff) / 2; }; sort_(t[0], t[1]); sort_(t[1], t[2]); sort_(t[0], t[1]); } void swap_sort_bit_hack (std::array<int, 3>& t) { auto sort_ = [](auto& x, auto& y) { constexpr auto shift = (sizeof(int) * CHAR_BIT - 1); const auto temp = y + ((x - y) & ((x - y) >> shift)); y = x - ((x - y) & ((x - y) >> shift)); x = temp; }; sort_(t[0], t[1]); sort_(t[1], t[2]); sort_(t[0], t[1]); }

Reveal the truth

It is even faster than before and it generates even better code. mov edx, DWORD PTR [rdi+4] mov eax, DWORD PTR [rdi] lea ecx, [rax+rdx] sub eax, edx cdq xor eax, edx sub eax, edx mov edx, ecx sub edx, eax add eax, ecx mov ecx, eax mov esi, edx shr ecx, 31 shr esi, 31 add eax, ecx mov ecx, DWORD PTR [rdi+8] add edx, esi sar eax sar edx lea r8d, [rcx+rax] sub eax, ecx mov ecx, eax sar ecx, 31 xor eax, ecx sub eax, ecx mov ecx, r8d sub ecx, eax add eax, r8d mov esi, ecx shr esi, 31 add esi, ecx mov ecx, eax shr ecx, 31 sar esi add eax, ecx lea ecx, [rsi+rdx] sar eax mov DWORD PTR [rdi+8], eax mov eax, edx sub eax, esi cdq xor eax, edx sub eax, edx mov edx, ecx sub edx, eax mov esi, edx shr esi, 31 add edx, esi sar edx add eax, ecx mov DWORD PTR [rdi], edx mov edx, eax shr edx, 31 add eax, edx sar eax mov DWORD PTR [rdi+4], eax ret mov ecx, DWORD PTR [rdi] mov edx, DWORD PTR [rdi+4] mov eax, ecx sub eax, edx mov esi, eax sar esi, 31 and eax, esi add edx, eax sub ecx, eax mov eax, DWORD PTR [rdi+8] mov esi, ecx sub esi, eax mov r8d, esi sar r8d, 31 and esi, r8d sub ecx, esi add eax, esi mov DWORD PTR [rdi+8], ecx mov ecx, edx sub ecx, eax mov esi, ecx sar esi, 31 and ecx, esi sub edx, ecx add eax, ecx mov DWORD PTR [rdi+4], edx mov DWORD PTR [rdi], eax ret Yes. It is GCC-specific though. On Clang, the best code still comes from min/max swaps.

Bonus round.

Since we know that the algorithm is supposed to work with small arrays, maybe we can help the compiler to pick the types better. I was always curious to try uint_fast8_t somewhere.

template <size_t N> void index_sort(std::array<int, N>& t) { std::array<int, N> a = t; for(auto i = 0u; i < N; ++i) { auto k = 0u; for(auto j = 0u; j < N; ++j) { if(j > i) k += int(a[i] > a[j]); else if(j < i) k += int(a[i] >= a[j]); } t[k] = a[i]; } } template <size_t N> void index_sort_fast_t(std::array<int, N>& t) { std::array<int, N> a = t; for(uint_fast8_t i = 0u; i < N; ++i) { uint_fast8_t k = 0u; for(uint_fast8_t j = 0u; j < N; ++j) { if(j > i) k += uint_fast8_t(a[i] > a[j]); else if(j < i) k += uint_fast8_t(a[i] >= a[j]); } t[k] = a[i]; } }

Reveal the truth

There is no meaningful difference. mov rax, QWORD PTR [rdi] mov ecx, DWORD PTR [rdi+8] xor esi, esi mov edx, eax shr rax, 32 cmp edx, eax setg sil xor r8d, r8d cmp edx, ecx setg r8b add esi, r8d mov DWORD PTR [rdi+rsi*4], edx xor esi, esi cmp edx, eax setle sil xor r8d, r8d cmp ecx, eax setl r8b add esi, r8d cmp edx, ecx setle dl cmp ecx, eax mov DWORD PTR [rdi+rsi*4], eax setge al movzx edx, dl movzx eax, al add eax, edx mov DWORD PTR [rdi+rax*4], ecx ret mov rax, QWORD PTR [rdi] mov ecx, DWORD PTR [rdi+8] mov edx, eax shr rax, 32 cmp edx, eax setg sil cmp edx, ecx setg r8b add esi, r8d cmp edx, eax movzx esi, sil mov DWORD PTR [rdi+rsi*4], edx setle sil cmp ecx, eax setl r8b add esi, r8d cmp edx, ecx setle dl movzx esi, sil cmp ecx, eax mov DWORD PTR [rdi+rsi*4], eax setge al add eax, edx movzx eax, al mov DWORD PTR [rdi+rax*4], ecx ret The code is accidentally different but the difference is too small. This is consistent across compilers and CPUs. No magics gains here.