This post is a follow up of Parallel Merge Sort in Java. In this previous post, we saw a possible implementation of the merge sort algorithm using Java ForkJoinPool .

In a nutshell, the solution was based on:

A thread pool to limit the number of threads

A work-stealing strategy in order to optimize the distribution of the workload

Interestingly enough, the solution in Go leveraging parallelism is based on the same high-level concepts.

Sequential Implementation

The sequential implementation of the merge sort algorithm is the following:

If you are not familiar with Go, please note that s is a slice (basically, a dynamically-sized version of an array). Each sub-call to mergesort is made by creating another slice which is just a view of the initial slice (and not a copy). The first call is done on the first half of the slice whereas the second call is done on the second half. At the end, the merge function merges both halves and make sure that s becomes sorted.

Let’s run a benchmark to give us a starting point. The test is running on an Intel Core i7–7700 at 3.60GHz with 8 virtual CPU cores (because of Hyper-threading technology) on a slice composed of 1 million elements.

benchSequential avgt 129.8 ms/op

Parallel Implementation

Pseudocode

In pseudocode, the parallel solution could be something like that:

mergesort(s []int) {

if len(s) > 1 {

middle := len(s) / 2 // Create two sub-tasks

createTask -> mergesort(s[:middle])

createTask -> mergesort(s[middle:]) merge(s, middle)

}

}

Instead of creating one thread per mergesort call, the main idea is to handle parallelism through a thread pool and to distribute the workload. A task is wrapping the intention of applying mergesort on a sub-part of the slice. Every task becomes then available for the threads of the pool.

Concurrency/Parallelism in Go

In Go, a developer can’t instantiate by himself a thread. The main primitive for concurrency/parallelism is the goroutine.

A goroutine can be pictured as an application-level thread. Instead of being context-switched on and off a CPU core, a goroutine is context-switched on and off an OS thread. The magic is handled by the Go scheduler which runs at user space, just like our application.

The main task of any scheduler is to distribute a workload. In Go, the scheduler distributes the goroutines over multiple threads. There are two important points to mention though.

First, it is worth understanding that compared to the OS scheduler which is a preemptive system (interruption every x ms), the Go scheduler is cooperative. It means a goroutine will be context-switched off an OS thread only on purpose (for example in the case of a blocking I/O call). This will prevent multiple context-switches which may occur before to complete a given job.

The second point is about the Go scheduling strategy. The two main strategies are the following:

Work-sharing: When a processor generates new threads, it attempts to migrate some of them to the other processors with the hopes of them being utilized by the idle/underutilized processors. Work-stealing: An underutilized processor actively looks for other processor’s threads and “steal” some. JBD in Go’s work-stealing scheduler

Go’s scheduler is based on the second strategy, work-stealing. The main benefit of this strategy is to reduce the frequency of goroutines migration (from one thread/CPU core to another). The migration is only considered when a thread is idle (no more goroutines to handle or every goroutine is blocked due to an I/O call for example) and this is going to have a positive impact performance-wise.

As a summary:

Cooperative system: a goroutine is context-switched off an OS thread purposely

The Go runtime is based on a work-stealing strategy to optimize the distribution of the goroutines

First version

To implement a parallel version of the merge sort algorithm, we need to delegate the management of the two halves in two separated goroutines. Running a goroutine through go keyword is a pure asynchronous task so we need to wait for the completion of both goroutines before to run merge .

Let’s write a v1:

The wait operation is managed by var wg sync.WaitGroup . Before to trigger the goroutines, we configure it to wait for 2 wg.Done() calls.

Let’s run a benchmark of this implementation:

benchSequential avgt 129.8 ms/op

benchParallelv1 avgt 438.8 ms/op

Our v1 is about 3 times slower than the sequential implementation. If you have read the previous post, you may have already noticed that we are probably facing the very same problem that we already described.

If we have an initial slice of 1024 elements mergesort will be applied to 512 elements, then 256, then 128 and so on until reaching 1 element. It means even for running mergesort on a single element, we are going to create a dedicated goroutine for that.

It’s true that goroutines are more lightweight and faster to start than threads. Yet, creating a goroutine for a small slice and having to wait for the Go runtime to schedule it is way more expensive than calling the sequential implementation in the same goroutine.

Let’s run go trace to check our assumption on a one millisecond time range: