Update: Continued in this post.

I recently came across the need for an incremental sorting algorithm, and started to wonder how to do it optimally.

The incremental sorting problem is described here, and is an “online” version of partial sort. That is, if you have already sorted elements, you should be able to quickly sort elements, and so on.

Use Cases

Incremental sorts can be useful for a number of cases:

You want sorted items, but you don’t know how many elements you’ll need. This could often happen when you are filtering the resulting sequence, and you don’t know how many items will be filtered out.

You are streaming the sequence, so even though you want the whole sequence, you want the first elements as quickly as possible.

We’ll see how branch misprediction and other constant factors can make the naive asymptotically optimal version far slower than a practical implementation.

Measuring Performance

Measuring the performance of an incremental sort is a bit more involved than full sort, because for a given we care about how fast we reach each element. To keep it simple, I’ll keep fixed at 10 million, and measure the time taken to reach each element. The input range is a vector of random integers.

Pre-Sorting

Perhaps the simplest possible solution is pre-sorting the entire range, and then just using plain iterators to iterate over the sorted range.

template < typename I > class pre _sorter { public : pre_sorter ( I i1 , I i2 ) : first ( i1 ), last ( i2 ) { std :: sort ( first , last ); } I begin () { return first ; } I end () { return last ; } private : I first ; I last ; };

Unsurprisingly, the performance of this algorithm is virtually independent of , as all the work is done at the start. This is obviously, not a good solution, but it’s a nice reference point, especially when approaches .

Pre-Sorting (ms) k Pre-Sorter 8 1083.18 16 1083.18 32 1083.18 64 1083.18 128 1083.18 256 1083.18 512 1083.18 1024 1083.18 2048 1083.18 4096 1083.18 8192 1083.19 16384 1083.19 32768 1083.22 65536 1083.24 131072 1083.28 262144 1083.36 524288 1083.48 1048576 1083.71 2097152 1084 4194304 1084.61 8388608 1085.75 10000000 1086.14

Partial Sorting

To avoid doing all the sorting work up front, the next logical step is trying to use something like std::partial_sort to do the work. The idea is to to start by partially sorting a constant small part of the range and remember how much of the range is sorted. When you run out of sorted elements, do another partial sort.

Since a partial sort must look at all the remaining elements of the range, we can’t keep partially sorting small ranges, that would lead to quadratic complexity. Instead, we multiply the number of elements to sort each time by some constant, giving us a logarithmic number of partial sorts.

template < typename I > class partial _sorter { public : partial_sorter ( I i1 , I i2 ) : first ( i1 ), last ( i2 ) {} class iterator { public : ... iterator & operator ++ () { ensure_sorted_at_current (); ++ current ; return * this ; } value_type & operator * () { ensure_sorted_at_current (); return * current ; } private : void ensure_sorted_at_current () { if ( current == sort_end ) { sort_end = sort_size < end - sort_end ? sort_end + sort_size : end ; std :: partial_sort ( current , sort_end , end ); sort_size *= 2 ; } } I current ; I sort_end ; I end ; int sort_size = 100 ; }; ... };

This method is a huge improvement for small , but it performs worse for 1 million.

Partial Sorting (ms) N Partial Sorter Pre-Sorter 8 7.8649 1083.18 16 7.86566 1083.18 32 7.86566 1083.18 64 7.86566 1083.18 128 15.5341 1083.18 256 15.5348 1083.18 512 23.1146 1083.18 1024 31.8237 1083.18 2048 41.0193 1083.18 4096 52.1542 1083.18 8192 68.4948 1083.19 16384 93.337 1083.19 32768 135.449 1083.22 65536 202.845 1083.24 131072 308.542 1083.28 262144 496.541 1083.36 524288 832.594 1083.48 1048576 1459.07 1083.71 2097152 2387.17 1084 4194304 3569.34 1084.61 8388608 4102.78 1085.75 10000000 4105.16 1086.14

Since the std::partial_sort algortithm costs , the calls to std::partial_sort has a total worst case run time of . This means that in total we are doing more work than we should, time to fix this.

Incremental Quicksort

In a paper from 2006, Paredes and Navarro describes an algorithm which uses an incremental version of quicksort to achieve the optimal bound of for incremental sorting.

This method keeps a stack of partition points for the range. Each partition point is an iterator to a position in the range where everything before the position is smaller than everything after. The stack is sorted such that the top of the stack is the position closest to the start of the range.

This stack has the wonderful property that when you are looking for the next element in the sorted range, and your current position is not equal to the top of the stack, the next element must be between the current position and the top of the stack.

If the distance to the next partition point is large, you cut it in half with std::partition , and push the new partition point on the stack. If the distance is small, we can sort the small range with std::sort , and pop the element from the stack. A simple version is shown here:

template < typename I > class simple _quick_sorter { public : simple_quick_sorter ( I i1 , I i2 ) : first ( i1 ), last ( i2 ) {} class iterator { public : ... iterator & operator ++ () { ensure_sorted_at_current (); ++ current ; return * this ; } value_type & operator * () { ensure_sorted_at_current (); return * current ; } private : void ensure_sorted_at_current () { if ( current == sort_end ) { while ( stack . back () - sort_end > sort_limit ) { auto range_size = stack . back () - current ; value_type pivot = * ( current + ( mt () % range_size )); auto it = std :: partition ( current , stack . back (), [ = ]( const value_type & v ) { return v < pivot ; }); while ( it == current ) { pivot = * ( current + ( mt () % range_size )); it = std :: partition ( current , stack . back (), [ = ]( const value_type & v ) { return v < pivot ; }); } stack . push_back ( it ); } std :: sort ( sort_end , stack . back ()); sort_end = stack . back (); stack . pop_back (); } } I begin ; I current ; I sort_end ; std :: vector < I > stack ; static const int sort_limit = 100 ; std :: mt19937 mt { std :: random_device {}() }; }; ... };

Looking at the results for this algorithm, it’s clear that the total amount of work done when is now optimal, because the time to sort the whole range is now as fast as a single call to std::sort . This has the benefit that you don’t need to worry about switching over to std::sort if you know that you’ll need a large part of the result, you can just use the incremental version anyway.

Simple Quick Sorting (ms) k Partial Sorter Pre-Sorter Simple Quick Sorter 8 7.8649 1083.18 53.6251 16 7.86566 1083.18 53.6259 32 7.86566 1083.18 53.6263 64 7.86566 1083.18 53.627 128 15.5341 1083.18 53.6305 256 15.5348 1083.18 53.6377 512 23.1146 1083.18 53.646 1024 31.8237 1083.18 53.6704 2048 41.0193 1083.18 53.7103 4096 52.1542 1083.18 54.2732 8192 68.4948 1083.19 54.7213 16384 93.337 1083.19 55.2367 32768 135.449 1083.22 56.2064 65536 202.845 1083.24 57.8833 131072 308.542 1083.28 65.7102 262144 496.541 1083.36 74.766 524288 832.594 1083.48 92.7102 1048576 1459.07 1083.71 142.352 2097152 2387.17 1084 219.68 4194304 3569.34 1084.61 428.108 8388608 4102.78 1085.75 796.165 10000000 4105.16 1086.14 919.948

The downside of this algorithm, though, is that it does spend considerably more time than std::partial_sort coming up with the first few elements. If I need between 1 and 5000 elements, partial sorting is still the way to go by the looks of it. Why is that?

Number of comparisons

Even though std::partial_sort is only guaranteed to have , a typical implementation will have closer to comparisons for small and random data. This is because the impementation typically makes a max heap of the first elements, and then walks through the rest of the range, inserting elements if they are smaller than the current max.

When is small, the elements in the heap will quickly become much smaller on average than most elements in the rest of the range, so insertions into the heap become less frequent as the range is processed. That means that for most of the elements in the range, the algorithm will just do a single comparison to verify that the element should not be in heap.

The incremental quick sort implementation, on the other hand, does a few more comparisons to find the first element. The first partition operation does comparisons to cut the range in two, the next partition takes comparisons to cut the first half down to a quarter, and so on. In total, it does about comparisons to find the first element, twice that of a good partial sort.

The difference in performance is much more than the difference in comparisons can explain, however, which brings us to the next point.

Branch Mispredictions

Few comparisons is not the only benefit of partial sort, it also has the benefit that most of the comparisons gives the same result. This gives a very fast inner loop, because it avoids branch misprediction.

For a partition operation that cuts the range in half, the number of branch mispredictions is at a peak, about every other comparison will be a misprediction. To verify that this was a big contributing factor, I tested std::partition with various pivot elements, that cut the range at various percentiles, to see the effect. I also compared with two hand-written loops (one normal branching and one branchless) to see that effect added up.

We see that partition speed is heavily impacted by the branch mispredictions, it seems to account for up to difference in speed.

(ms) % std::partition branching branchless 0 11.7968 16.0224 36.7653 1 15.7772 16.8826 33.9671 2 14.6577 17.0579 34.0169 3 17.5502 20.1389 35.739 4 14.3418 17.9083 31.4708 5 14.5596 18.1637 26.7742 6 15.3439 17.9451 23.1884 7 17.2111 20.018 27.8515 8 18.2778 21.125 24.2756 9 18.0675 23.2834 26.2831 10 20.2005 21.3348 23.0222 11 19.2368 22.0339 22.3802 12 18.808 21.2295 24.2208 13 19.1338 21.9753 22.3669 14 20.8068 26.4336 23.3153 15 24.336 27.0468 21.5044 16 27.4645 29.8541 23.4868 17 28.8201 31.1758 21.2261 18 28.8148 30.1289 21.0718 19 31.28 30.2661 20.9075 20 29.6518 31.3142 21.3401 21 29.8271 30.7174 22.6782 22 34.0443 34.7833 23.1051 23 35.8298 34.5951 22.7919 24 32.7097 35.3106 20.9938 25 31.8623 34.2561 21.1079 26 32.1733 32.9834 20.6646 27 35.5523 37.6043 21.0353 28 33.6041 35.7314 22.0145 29 35.6302 37.6708 22.6771 30 35.772 36.4316 20.9463 31 36.0035 36.7326 20.8304 32 36.194 36.2179 22.7599 33 39.0462 38.6444 31.429 34 39.7669 42.4499 23.0538 35 43.1285 46.0829 26.6244 36 37.2333 39.8859 20.7182 37 37.09 38.6409 20.9155 38 37.7157 39.1157 20.6312 39 38.1182 40.1368 20.7224 40 39.8334 40.4454 20.4137 41 40.6355 46.655 23.8106 42 41.9736 43.0969 24.2189 43 40.7712 40.3037 20.5175 44 42.2876 41.7527 20.1051 45 42.1777 41.9067 20.5757 46 40.6701 43.536 20.2902 47 41.7398 45.5051 24.6454 48 43.1851 42.8141 20.4784 49 41.3509 41.6771 22.1609 50 42.4328 40.539 20.1609 51 41.5824 43.2767 20.4134 52 40.3546 40.9339 26.5815 53 44.1628 46.8291 26.9776 54 40.9811 40.3725 20.4259 55 42.1538 40.4812 19.9891 56 39.5023 40.6363 20.3544 57 39.0777 40.1889 20.3947 58 37.921 39.2206 20.2054 59 41.3262 39.4255 20.5152 60 40.3523 38.2475 20.1613 61 39.4046 39.5966 19.9644 62 37.3576 39.0678 21.4956 63 36.8756 36.6924 20.3605 64 38.4406 36.2909 21.4831 65 36.8231 41.622 20.695 66 35.9872 35.2699 23.0834 67 36.0716 36.7053 20.3126 68 37.0869 35.4938 20.5791 69 34.4689 40.1645 21.3112 70 33.6399 34.2032 24.033 71 34.838 37.0558 23.7346 72 30.5919 29.7374 26.7609 73 33.2331 35.1296 23.0093 74 29.699 29.2873 24.3592 75 28.6084 28.5753 20.2883 76 29.4534 28.2107 20.1112 77 27.3216 26.8232 21.4511 78 28.1415 28.4012 25.6908 79 27.8272 30.2566 21.0444 80 25.8417 26.1306 22.6178 81 25.1115 24.6629 20.0351 82 25.311 26.4792 20.3449 83 25.6752 24.4459 20.0788 84 25.4106 25.7623 20.1503 85 23.3389 36.3533 27.6375 86 24.1425 29.1322 22.8158 87 20.8068 22.4148 20.2659 88 19.4649 19.4448 21.3067 89 21.5838 18.8339 19.9542 90 17.4586 17.92 22.1331 91 17.3696 17.5084 19.6497 92 14.3699 15.4465 24.1265 93 13.7956 15.6993 22.4881 94 14.2673 18.4283 21.9248 95 12.5422 13.2299 19.7614 96 11.3406 13.7781 20.9623 97 7.63467 9.61557 19.5337 98 6.10576 10.8198 19.3897 99 5.13869 8.40483 19.2806

Skewed Incremental Quick Sort

After looking at the branch misprediction measurements, I realized that a balanced partition operation could not be the basis for an optimal incremental sort, but a skewed one could. If we skew the first partitions towards the beginning of the range, we get both fewer comparisons and branch mispredictions. The skew comes with a cost, however, as there is more work to do in total if we consistently get very skewed partitions. To counteract this effect, only the first partitions should be skewed. As we iterate further into the range, the partitions should be made less skewed, so we get about the right amount of work.

In the implementation of skewed incremental quicksort, I skewed the partitions by sampling a large number of pivots, sorting them, and choosing the pivot corresponding to the desired amount of skew.

value_type get_pivot ( int num_pivots , int pivot_idx ) { assert ( num_pivots <= 100 ); std :: array < value_type , 100 > pivots ; for ( int i = 0 ; i < num_pivots ; ++ i ) { pivots [ i ] = * ( current + ( mt () % ( stack . back () - current ))); } std :: sort ( pivots . begin (), pivots . begin () + num_pivots ); return pivots [ pivot_idx ]; } value_type get_pivot () { auto range_size = stack . back () - current ; if ( range_size < 1000 ) { return get_pivot ( 3 , 1 ); } double ratio = ( current - begin ) / static_cast < double > ( stack . back () - begin ); if ( range_size < 10'000 ) { int idx = ratio * 10 ; idx = std :: min ( idx , 5 ); idx = std :: max ( idx , 1 ); return get_pivot ( 10 , idx ); } int idx = ratio * 10 ; idx = std :: min ( idx , 50 ); idx = std :: max ( idx , 1 ); return get_pivot ( 100 , idx ); }

Skewed Quick Sorting (ms) k Partial Sorter Pre-Sorter Simple Quick Sorter Skewed Quick Sorter 8 7.8649 1083.18 53.6251 6.19743 16 7.86566 1083.18 53.6259 6.19819 32 7.86566 1083.18 53.6263 6.19819 64 7.86566 1083.18 53.627 6.19857 128 15.5341 1083.18 53.6305 6.20123 256 15.5348 1083.18 53.6377 6.20617 512 23.1146 1083.18 53.646 6.21415 1024 31.8237 1083.18 53.6704 6.23734 2048 41.0193 1083.18 53.7103 6.3947 4096 52.1542 1083.18 54.2732 6.48554 8192 68.4948 1083.19 54.7213 6.84245 16384 93.337 1083.19 55.2367 7.52472 32768 135.449 1083.22 56.2064 8.93866 65536 202.845 1083.24 57.8833 11.1668 131072 308.542 1083.28 65.7102 24.4606 262144 496.541 1083.36 74.766 34.4201 524288 832.594 1083.48 92.7102 53.3226 1048576 1459.07 1083.71 142.352 110.043 2097152 2387.17 1084 219.68 216.516 4194304 3569.34 1084.61 428.108 421.007 8388608 4102.78 1085.75 796.165 792.765 10000000 4105.16 1086.14 919.948 927.276

Summary

By combining the incremental quick sort algorithm with smart pivot selection, we can get an incremental sorting algorithm that is as fast as partial sort for small , and as fast as full sort when approaches .

This is my first real blog post, so comments and suggestions are welcome!

Update: Continued in this post.