I'm not sure this is worth noting, given @lehins excellent answer, but...

Why your pQuickSort doesn't work

There are two big problems with your pQuickSort . The first is that you're using System.Random , which is bog slow and interacts strangely with a parallel sort (see below). The second is that your par u l sparks a computation to evaluate:

u = [x] ++ pQuicksort (n `div` 2) upper

to WHNF, namely u = x : UNEVALUATED_THUNK , so your sparks aren't doing any real work.

Observing an improvement with a simple pseudo-quicksort

In fact, it's not difficult to observe a performance improvement when parallelizing a naive, not-in-place, pseudo-quicksort. As mentioned, an important consideration is to avoid using System.Random . With a fast LCG, we can benchmark the actual sort time, rather than some weird mixture of sort and random number generation. The following pseudo-quicksort:

import Data.List qsort :: Ord a => [a] -> [a] qsort (x:xs) = let (a,b) = partition (<=x) xs in qsort a ++ x:qsort b qsort [] = [] randomList :: Int -> [Int] randomList n = take n $ tail (iterate lcg 1) where lcg x = (a * x + c) `rem` m a = 1664525 c = 1013904223 m = 2^32 main :: IO () main = do let randints = randomList 5000000 print . sum $ qsort randints

when compiled with GHC 8.6.4 and -O2 , runs in about 9.7 seconds. The following "parallelized" version:

qsort :: Ord a => [a] -> [a] qsort (x:xs) = let (a,b) = partition (<=x) xs a' = qsort a b' = qsort b in (b' `par` a') ++ x:b' qsort [] = []

compiled with ghc -O2 -threaded runs in about 11.0 seconds on one capability. Add +RTS -N4 , and it runs in 7.1 seconds.

Ta da! An improvement.

(In contrast, the version with System.Random runs in about 13 seconds for the non-parallel version, about 12 seconds for the parallel version on one capability (probably just because of some minor strictness improvement), and slows down substantially for each additional capability added; timings are erratic, too, though I'm not quite sure why.)

Splitting up partition

One obvious problem with this version is that, even with a' = qsort a and b' = qsort b running in parallel, they're tied to the same sequential partition call. By dividing this up into two filters:

qsort :: Ord a => [a] -> [a] qsort (x:xs) = let a = qsort $ filter (<=x) xs b = qsort $ filter (>x) xs in b `par` a ++ x:b qsort [] = []

we speed things up to about 5.5 seconds with -N4 . To be fair, even the non-parallel version is actually slightly faster with two filters in place of the partition call, at least when sorting Ints . There are probably some additional optimizations that are possible with the filters compared to the partition that make the extra comparisons worth it.

Reducing the number of sparks

Now, what you tried to do in pQuickSort above was to limit the parallel computations to the top-most set of recursions. Let's use the following psort to experiment with this:

psort :: Ord a => Int -> [a] -> [a] psort n (x:xs) = let a = psort (n-1) $ filter (<=x) xs b = psort (n-1) $ filter (>x) xs in if n > 0 then b `par` a ++ x:b else a ++ x:b psort _ [] = []

This will parallelize the top n layers of the recursion. My particular LCG example with a seed of 1 (i.e., iterate lcg 1 ) recurses up to 54 layers, so psort 55 should give the same performance as the fully parallel version except for the overhead of keeping track of layers. When I run it, I get a time of about 5.8 seconds with -N4 , so the overhead is quite small.

Now, look what happens as we reduce the number of layers:

| Layers | 55 | 40 | 30 | 20 | 10 | 5 | 3 | 1 | |--------+-----+-----+-----+-----+-----+-----+-----+------| | time | 5.5 | 5.6 | 5.7 | 5.4 | 7.0 | 8.9 | 9.8 | 10.2 |

Note that, at the lowest layers, there's little to be gained from parallel computation. This is mostly because the average depth of the tree is probably around 25 layers or so, so there's only a handful of computations at 50 layers many with weird, lop-sided partitions, and they're certainly too small to parallelize. On the flip side, there doesn't seem to be any penalty for those extra par calls.

Meanwhile, there are increasing gains all the way down to at least 20 layers, so trying to artificially limit the total number of sparks to 16 (e.g., top 4 or 5 layers), is a big loss.