Illustration created for “A Journey With Go”, made from the original Go Gopher, created by Renee French.

Goroutines are lights and can often look like the solution to improve our application. Unfortunately, a bad usage of the goroutines could decrease the performance of your application, since context-switch of the goroutines has a cost.

Project context and benchmarks

In my company — PropertyFinder, real estate portal in UAE — my squad is working on a micro service that is responsible for finding potential opportunities for our clients, the real estate brokers. Let’s summarize a part of this Go microservice in algorithm:

Variables:

lead struct Start:

// a profile of this lead is made based on its interest

MakeProfile(lead) // we get all listing that are around of this kind of profile

listings <- GetListings()

For each chunk of 1000 in listings

Start goroutine

For each listing in chunk

score <- CalculateMatching(listing, lead)

Add the score to the bulk object

Bulk insert the 100 scores of the chunk Stop

Since listings here could reach 10k or more, we have decided to create a goroutine by each chunk of 1000. Here is our benchmark for the calculation and recording of 10k matching:

name time/op

LeadMatchingGenerationFor10000Matches-4 626ms ± 6%

Let’s try to add more goroutines for the calculation part. Here is a change we could make on our code:

// we get all listing that are around of this kind of profile

listings <- GetListings()

For each chunk of 1000 in listings

Start goroutine

For each listing in chunk

Start goroutine

score <- CalculateMatching(listing, lead)

Add the score to the bulk object

Bulk insert the 1000 scores

And let’s run the benchmark again:

name time/op

LeadMatchingGenerationFor10000Matches-4 698ms ± 4%

It is now 11% slower, but that was expected. Indeed, the score calculation is purely math and there are therefore no opportunities for the Go scheduler to take advantages of any pause (e.g. system calls) in the new goroutines.

If you are not familiar with the Go scheduler and the context-switching of the goroutines, I strongly recommend you read the article about the context-switching made by William Kennedy.

Goroutine latencies due to context-switch

We will now analyze how goroutines are run by the Go scheduler. Let’s start with our first algorithm with 1 goroutine per chunk of 1000. We will now run the benchmark enabling the trace of the Go scheduler:

GODEBUG=schedtrace=1 go test ./... -run=^$ -bench=LeadMatchingGenerationFor10000Matches -benchtime=1ns

The value schedtrace=1 will print every 1ms scheduling events from the Go scheduler. Here is a part of the trace with the information related to the new goroutines created previously along with the idle processors:

gomaxprocs=2 idleprocs=1 runqueue=0 [0 0]

gomaxprocs=2 idleprocs=1 runqueue=0 [0 0]

gomaxprocs=2 idleprocs=1 runqueue=0 [0 0]

gomaxprocs=2 idleprocs=0 runqueue=1 [0 0]

gomaxprocs=2 idleprocs=2 runqueue=0 [0 0]

gomaxprocs=2 idleprocs=2 runqueue=0 [0 0]

gomaxprocs displays the number of available processors, the number of idling threads is idleprocs , the number of goroutine waiting to be dispatched equals runqueue, and the brackets represent the number of goroutine waiting to run per processors ( [0 0] ). You could get more information about their meaning in the wiki dedicated to the performance.

We clearly see there is no overwhelming usage of goroutines. We also see that our processors are not always busy, and we could wonder if we should add more goroutines to takes advantages of those free resources. Let’s try to run the same profiling on the second algorithm with one goroutine per scoring calculation:

gomaxprocs=2 idleprocs=0 runqueue=645 [116 186]

gomaxprocs=2 idleprocs=0 runqueue=514 [77 104]

gomaxprocs=2 idleprocs=0 runqueue=382 [57 64]

gomaxprocs=2 idleprocs=0 runqueue=124 [57 88]

gomaxprocs=2 idleprocs=0 runqueue=0 [28 17]

gomaxprocs=2 idleprocs=1 runqueue=0 [0 0]

Now we can see that goroutines stacks in the global and local queues and it keeps our processors busy. However, we quickly have our processors idling again. The supplementary usage of the tracer will explain that:

goroutines profiling with tracer

goroutine 209 waiting for server response

As we can see, most of our goroutines are waiting for a response from the server at the moment of the bulk record. This is where we should focus our improvement and take advantage of this waiting time. This is explains why we have created goroutines to record each bunches of 1000 documents.

We also now understand that adding goroutines on the calculation will just overwhelm the application with no gain. Since the calculation has no pause for the system to allow another goroutine to run while the current one is waiting, the time to switch to another goroutine will just be wasting time.

Going deeper with goroutines

Go Scheduler and concurrency are important things to know in order to improve the way you create your goroutines. One of the best online resource is probably the series made by Wiliam Kennedy about the Scheduling in Go. I strongly recommend you look at it.