Illustration created for “A Journey With Go”, made from the original Go Gopher, created by Renee French.

ℹ️ This article is based on Go 1.13.

Goroutines are light; they just need a memory stack of 2Kb to run. They are also cheap to run; switching a goroutine to another one does not require many operations. Before jumping into the switch itself, let’s review how the switch works at a higher level.

Before continuing this article, I strongly suggest reading my article “Go: Goroutine, OS Thread and CPU Management” to understand the notions explained here.

Cases

Go schedules the goroutines onto the threads based on two kinds of breakpoints:

When a goroutine blocks: system call, mutex, or channel. The blocked goroutine goes into sleeping mode/into a queue and allows Go to schedule and run an awaiting goroutine.

During a function call, at the prolog, if the goroutine has to grow its stack. This breakpoint allows Go to schedule another goroutine and avoid the running one hogging the CPU.

In both cases, the g0 that runs the scheduler replaces the current goroutine by another one, ready to run. Then, the chosen goroutine replaces g0 and runs on the thread.

For more information about g0 , I suggest you read my article “Go: g0, Special Goroutine.”

Switch a running goroutine by another involves two switches:

The running g to g0 :

g0 to the next g to run:

In Go, a goroutine switch is really light. In order to save, it only needs two things:

The line where the goroutine stopped before being unscheduled. The current instruction to run is recorded in a program counter ( PC ). The goroutine will later resume at the same point.

). The goroutine will later resume at the same point. The stack of the goroutine, in order to restore the local variable when it runs again.

Let’s see how it works in practice.

Program counter

For the sake of the example, I will use goroutines that communicate through a channel, one that produces data and some that consume them. Here is the code:

The consumers will basically print the even numbers from 0 to 99. We will focus on the first goroutine — the producer — that adds numbers to the buffer. When the buffer gets full, it will block when sending a message. At this point, Go has to switch to g0 and schedule another goroutine.

As seen previously, Go first needs to save the current instruction in order to restore the goroutine at the same instruction. The program counter ( PC ) is saved in an internal structure of the goroutine. Here is an example with the previous code:

The instructions and their addresses can be found with the command go tool objdump . Here are instructions of producer:

The program goes instruction by instruction before blocking on the channel at the function runtime.chansend1 . Go saves the current program counter to an internal property of the current goroutine. In our example, Go saves the program counter with the address 0x4268d0 that is inside the runtime and the method runtime.chansend1 :

Then, when g0 wakes the goroutine up, it will resume at the same instruction, looping on the values and pushing into the channel. Let’s move now to the stack management during the goroutine switch.

Stack

Before being blocked, the running goroutine has its original stack. This stack contains temporary memory like the variable i :

Then, when it blocks on the channel, the goroutine will be switched to g0 along with its stack, a bigger one:

Before the switch, the stack will be saved in order to be restored when the goroutine will run again:

We now have a complete view of the different operations involved in a goroutine switch. Let’s see now how it impacts performance.

We should note that some architecture— like arm — needs to save one more register, LR the link register.

Operations

To measure the time a switch could take, we will use the program seen previously. However, it will not give a perfect view of the performance since it can depend on the time it takes to find the next goroutine to schedule. This way the goroutine switch could also impact the performance; a switch from a function prolog has more operations to do than a switch from a goroutine blocking on channels.

Let’s summarize the operation we are going to measure:

current g blocks on channel and switch to g0 :

- PC is saved along with the stack pointer in an internal structure

- g0 is set as the running goroutine

- g0 ‘s stack replaces the current stack

blocks on channel and switch to : - is saved along with the stack pointer in an internal structure - is set as the running goroutine - ‘s stack replaces the current stack g0 is looking for a new goroutine to run.

is looking for a new goroutine to run. g0 has to switch with the selected goroutine:

- PC and stack pointer are extracted from its internal structure

- The program jumps to the PC ‘s address extracted

Here are some results:

The switches from g to g0 or g0 to g are the fastest phases. They contain a small fixed number of instructions contrary to the scheduler that checks many sources to find the next goroutine to run. This phase could even take more time, according to the running program.

This benchmark gives an order of magnitude estimate of the performance. It should be taken with a pinch of salt; There is no standard tool to measure that. Also, the performance depends on the architecture, the machine (I’m running it on my Mac 2,9 GHz Dual-Core Intel Core i5.), and the running program.