Merge sort is a wonderful, widely used sorting algorithm, with consistent data-independent performance. When not in-place merge sorting, that is when the source and destination array are not the same, performance is O(nlgn). When in-place, so the source and destination array are the same, performance is slightly slower: O(nlg2n). Because not-in-place sorting algorithms are faster, implementations frequently allocate memory for the destination array, copy the sorted result back into the source array, and proceed to deallocate the destination array. STL uses this type of a strategy to construct a fast in-place sort from a faster not-in-place sort whenever memory is available. Of course, when memory is not available, in-place sort is necessary, and STL also provides such an implementation.

However, the STL sort is a non-parallel implementation and runs only on a single core of today's multicore processors, utilizing just a fraction of the available computational resources. Several parallel sorting algorithms have been developed: Parallel Radix Sort, Parallel Counting Sort, and Parallel Merge Sort.

At the core of merge sort is the merge algorithm. Fast not-in-place Parallel Merge Sort was developed using a Parallel Merge that is not-in-place. Also, a Parallel In-Place Merge has been developed. Now, it's time to use this merge to construct an in-place merge sort that is parallel, and then see how well it performs.

Implementation

Listing One shows a simple implementation of In-Place Merge Sort, which calls itself recursively twice  once to sort the left half and once to sort the right half, followed by an in-place merge. This in-place merge merges the two sorted left and right half results into a single sorted result. The implementation is generic, being able to sort any data type that can use comparisons.

Listing One

template< class _Type > inline void merge_sort_in_place( _Type* a, int l, int r ) { if ( r <= l ) return; int m = (( r + l ) / 2 ); merge_sort_in_place( a, l, m ); // recursive call left half merge_sort_in_place( a, m + 1, r ); // recursive call right half merge_in_place_L1( a, l, m, r ); // merge the results }

Listing Two shows a parallel implementation, which makes the two recursive calls in parallel, followed by a parallel in-place merge. It uses Intel's open-source Threading Building Blocks (TBB) or Microsoft's Parallel Pattern Library (PPL). The use of C++ lambda functions simplifies the implementation significantly. This implementation is also generic.

Listing Two

template< class _Type > inline void p_merge_sort_in_place( _Type* a, int l, int r ) { if ( r <= l ) return; int m = (( r + l ) / 2 ); //tbb::parallel_invoke( Concurrency::parallel_invoke( [&] { p_merge_sort_in_place( a, l, m ); }, // parallel recursive call left half [&] { p_merge_sort_in_place( a, m + 1, r ); } // parallel recursive call right half ); p_merge_in_place_2( a, l, m, r ); // parallel in-place merge }

The non-parallel and parallel versions are strikingly similar. Both are fully functional algorithms and are compact. In-place merge can be found in my Parallel In-Place Merge article, Listing One for non-parallel and Listing Three for parallel. Having these in-place merge implementations makes these compact merge sort implementations possible. If you want to contrast in-place merge with a not-in-place merge sort, take a look at Listing One in Parallel Merge Sort.

Let's take a look at performance of these algorithms.

Performance

Table 1 and Figure 1 compare performance of the non-parallel in-place merge sort with the parallel implementation, both in their initial pure forms, as well as STL sort and parallel not-in-place merge sort. Sorting of arrays of 32-bit unsigned integers were benchmarked on Intel i7 3630QM quad-core CPU with hyperthreading running at 3.2 GHz, using 16 GB of system memory.

Time Per Run Number of items 10 100 1K 10K 100K 1M 10M 100M 1B In-Place Merge Sort 3.55E-07 9.00E-06 1.52E-04 2.21E-03 2.96E-02 3.7E-01 4.5E+00 5.4E+01 6.4E+02 Parallel In-Place Merge Sort 2.6E-06 1.1E-05 6.6E-05 5.7E-04 5.52E-03 5.35E-02 6.42E-01 7.62E+00 8.77E+01 STL sort 1.07E-07 2.62E-06 4.17E-05 5.80E-04 7.33E-03 8.9E-02 1.0E+00 1.2E+01 1.4E+02 Parallel Merge Sort hybrid 2.5E-07 3.6E-06 2.0E-05 1.8E-04 1.38E-03 1.42E-02 1.64E-01 1.81E+00 2.13E+01 Merge Sort / Parallel Merge Sort 0.1 0.8 2.3 3.9 5.4 7.0 7.1 7.1 7.3 STL sort / Parallel In-Place 0.0 0.2 0.6 1.0 1.3 1.7 1.6 1.6 1.6 Parallel In-Place / Parallel Not-In-Place 10.4 3.0 3.3 3.2 4.0 3.8 3.9 4.2 4.1

Table 1.



Figure 1.

Performance measurements show that the non-parallel, in-place merge sort is the slowest algorithm. Parallel implementation accelerates performance by over 7X and is faster than STL sort by 1.6X. The not-in-place parallel merge sort is the fastest merge sorting algorithm, beating the parallel in-place version by more than 4X. It also beat STL by more than 6X.

However, the parallel not-in-place merge sort is a hybrid implementation, using a non-parallel algorithm to terminate parallel recursion early, whereas the parallel in-place merge sort is not a hybrid. Let's see if the parallel in-place merge sort can be optimized.