Illustration created for “A Journey With Go”, made from the original Go Gopher, created by Renee French.

Switching a goroutine from an OS thread to another one has a cost and can slow down the application if it occurs too often. However, through time, the Go scheduler has addressed this issue. It now provides affinities between the goroutines and the thread when working concurrently. Let’s go back years ago to understand this improvement.

Original Issue

During the early days of Go, Go 1.0 & 1.1, the language faces degraded performances when running concurrent code with more OS thread, i.e., a higher value of GOMAXPROCS . Let’s start with an example used in the documentation that calculates the prime numbers:

Here is the benchmark with Go 1.0.3 when calculating the first hundred thousand prime numbers with multiple values of GOMAXPROCS :

name time/op

Sieve 19.2s ± 0%

Sieve-2 19.3s ± 0%

Sieve-4 20.4s ± 0%

Sieve-8 20.4s ± 0%

To understand those results, we need to understand how the scheduler was designed at this time. In the first version of Go, the scheduler had only one global queue where all threads were able to push and get the goroutines. Here is an example of an application running with a maximum of two OS threads — M on the following schema — defined by setting GOMAXPROCS to two:

The first version of scheduler had only one global queue

Having only one queue does not guarantee that a goroutine will resume on the same thread. The first thread ready will pick up an awaiting goroutine and will run it. Therefore, it involves goroutines to go from a thread to another one, and it is costly in terms of performance. Here is an example with a blocking channel:

Goroutine #7 blocks on the channel and is waiting for a message. Once the message is received, the goroutine is pushed to the global queue:

Then, the channel pushes messages and goroutine #X will run on an available thread while the goroutine #8 blocks on the channel:

The goroutine #7 now runs on the available thread:

The goroutines now run on different threads. Having a single global queue also forces the scheduler to have a single global mutex that covers all goroutines scheduling operations. Here is the CPU profile created thanks to pprof with GOMAXPROCS set to height:

Total: 8679 samples

3700 42.6% 42.6% 3700 42.6% runtime.procyield

1055 12.2% 54.8% 1055 12.2% runtime.xchg

753 8.7% 63.5% 1590 18.3% runtime.chanrecv

677 7.8% 71.3% 677 7.8% dequeue

438 5.0% 76.3% 438 5.0% runtime.futex

367 4.2% 80.5% 5924 68.3% main.filter

234 2.7% 83.2% 5005 57.7% runtime.lock

230 2.7% 85.9% 3933 45.3% runtime.chansend

214 2.5% 88.4% 214 2.5% runtime.osyield

150 1.7% 90.1% 150 1.7% runtime.cas

procyield , xchg , futex and lock are all related to the global mutex of the Go scheduler. We clearly see that the application is spending most of its time in the lock.

These issues do not allow Go the advantages of the processors and has been addressed in Go 1.1 with a new scheduler.

Affinity in concurrency

Go 1.1 came with the implementation of a new scheduler and the creation of local goroutine queues. This improvement avoids locking the entire scheduler if there are local goroutines and allows them to work on the same OS thread.

Since threads can block on system calls and the number of blocked threads is not limited, Go introduced the notion of processors. A processor P represents a running OS thread and will manage the local goroutines queues. Here is the new schema:

Here is the new benchmark with the new scheduler in Go 1.1.2:

name time/op

Sieve 18.7s ± 0%

Sieve-2 8.26s ± 0%

Sieve-4 3.30s ± 0%

Sieve-8 2.64s ± 0%

Go now really takes advantage of all the available CPU. The CPU profile has also changed:

Total: 630 samples

163 25.9% 25.9% 163 25.9% runtime.xchg

113 17.9% 43.8% 610 96.8% main.filter

93 14.8% 58.6% 265 42.1% runtime.chanrecv

87 13.8% 72.4% 206 32.7% runtime.chansend

72 11.4% 83.8% 72 11.4% dequeue

19 3.0% 86.8% 19 3.0% runtime.memcopy64

17 2.7% 89.5% 225 35.7% runtime.chansend1

16 2.5% 92.1% 280 44.4% runtime.chanrecv2

12 1.9% 94.0% 141 22.4% runtime.lock

9 1.4% 95.4% 98 15.6% runqput

Most of the operations associated with the lock have been removed, the ones marked as chanXXXX are only related to the channels. However, if the scheduler has improved the affinity between a goroutine and a thread, there are some cases where this affinity can be reduced.

Affinity limitation

To understand the limits of the affinity, we have to understand what goes to the local and global queues. The local queue will be used for all operations expect system calls such as blocking operations on channels and selects, waiting on timers and locks. However, two features could restrict the affinity between a goroutine and a thread:

Work-stealing. When a processor P does not have enough work in its local queue, it will steal goroutines from others P if the global queue and the network poller are empty. When stolen, the goroutines will then run on another thread.

does not have enough work in its local queue, it will steal goroutines from others if the global queue and the network poller are empty. When stolen, the goroutines will then run on another thread. System calls. When a syscall occurs (e.g. files operations, http calls, database operations, etc.), Go moves the running OS thread in a blocking mode, letting a new thread processing the local queue on the current P .

However, by better managing the priority of the local queue, those two limitations could be avoided. Go 1.5 aims to give more priority to the goroutine communicating back and forth on a channel, and thus optimizing the affinity with the assigned thread.

Ordering to enhance affinity

A goroutine communicating back and forth on a channel results in frequent blocks, i.e. frequent re-queue in the local queue, as seen previously. However, since the local queue has a FIFO implementation, the unblock goroutines do not have guarantee to run as soon as possible if another goroutine is hogging the thread. Here is an example with a goroutine that is now runnable and was previously blocked on channels:

The goroutine #9 resumes after being blocked on the channel. However, it will have to wait for #2, #5, and #4 before running. In this example, the goroutine #5 will hog its thread, delaying the goroutine #9 and putting it at risk to be stolen by another processor. Since Go 1.5, the goroutines returning from a blocking channel will now run in priority thanks to a special attribute of its P :

The goroutine #9 is now marked as the next one runnable. This new prioritization allows the goroutine to run quickly before being blocked on the channel again. Then, the other goroutines will now have running time. This change had an overall positive effect on the Go standard library improving the performance of some packages.