The first sensible thing to do when looking at performance issues is to get a representative dataset that reproduces the issue, fire up htop and see what’s up. The desired state for our program would be 100% CPU utilisation (in mostly user space) across all 32 cores from start to end. As we can see, that is very much not the case here. We fire full blast for a few seconds, and then all cores but 1 stop. We pause for a second or so, and restart the loop. Peculiar. Poor goroutine scheduler. At this point, multiple things could be the problem: contention on a mutex, or a tight loop not being preempted by the scheduler being just two of my original suspects. To pprof, we ride!

The profiling tools in Go (pprof being only one of them) are really fantastic. It’s extremely easy to be able to pinpoint hotspots in your program with the CPU profiler, and to identify needless areas of allocation via the heap profiler. Using the web SVG view, I spotted a few easy optimisations that we could make:

A function in the hot loop was calling strings.Join to join a variadic number of string arguments to build a map key. However, these variadic number of arguments were always of length 2, or length 3. I knew from reading my colleague Phil Pearl’s blog on string concatenation that a compiler optimisation existed for string concatenation on a single line. Adding a simple statement to check and build them manually saved unnecessary allocations. Psst. I hear that in Go 1.8, there is a special compiler optimisation for strings.Join as well.

to join a variadic number of string arguments to build a map key. However, these variadic number of arguments were always of length 2, or length 3. I knew from reading my colleague Phil Pearl’s blog on string concatenation that a compiler optimisation existed for string concatenation on a single line. Adding a simple statement to check and build them manually saved unnecessary allocations. Psst. I hear that in Go 1.8, there is a special compiler optimisation for as well. Another function in the hot loop was returning a pointer to a value that we retrieved from a large map. This was causing the variable to be allocated on the heap every time. Removing the pointers from the structure saved another round of allocations.

These optimisations reduced the total number of allocations we made, reducing pressure on the GC. However, they wouldn’t alone fix the issue of poor CPU utilisation and scheduler thrashing. I proceeded to set the following GODEBUG variables to flush out issues with GC and goroutine scheduling:

gctrace=1 — setting this will cause your program to log whenever an GC occurs, with helpful statistics showing the time it took to perform various stages of the GC, the heap size before and after, and more

— setting this will cause your program to log whenever an GC occurs, with helpful statistics showing the time it took to perform various stages of the GC, the heap size before and after, and more schedtrace=500 — setting this will cause your program to log a summary line of the scheduler’s state every 500ms. It contains useful information about the state of the threads that your goroutines are multiplexed onto, whether they are idle, and how many goroutines are waiting to run.

The combined scheduler and GC trace below (some lines have been edited out for clarity) showed that the majority of my non runtime threads sat idle for a second or two just before a GC occurred, marrying up perfectly with CPU graphs. A smoking gun.



SCHED 307816ms: gomaxprocs=32 idleprocs=0 threads=75 spinningthreads=0 idlethreads=38 runqueue=10 [1 1 0 1 7 1 0 2 1 0 0 0 0 0 0 3 6 0 3 0 1 2 0 4 1 2 3 0 1 0 0 0]

SCHED 308319ms: gomaxprocs=32 idleprocs=0 threads=75 spinningthreads=0 idlethreads=38 runqueue=16 [1 0 0 1 1 1 4 0 4 1 0 0 2 2 0 2 0 3 0 1 0 0 0 0 7 3 2 0 1 1 7 0]

SCHED 308823ms: gomaxprocs=32 idleprocs=27 threads=75 spinningthreads=0 idlethreads=65 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

SCHED 313372ms: gomaxprocs=32 idleprocs=31 threads=75 spinningthreads=0 idlethreads=69 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

gc 30 SCHED 307314ms: gomaxprocs=32 idleprocs=0 threads=75 spinningthreads=0 idlethreads=38 runqueue=3 [1 2 1 1 1 1 0 2 4 1 0 1 0 4 0 0 3 0 8 1 0 0 5 2 0 0 2 0 1 1 1 3]SCHED 307816ms: gomaxprocs=32 idleprocs=0 threads=75 spinningthreads=0 idlethreads=38 runqueue=10 [1 1 0 1 7 1 0 2 1 0 0 0 0 0 0 3 6 0 3 0 1 2 0 4 1 2 3 0 1 0 0 0]SCHED 308319ms: gomaxprocs=32 idleprocs=0 threads=75 spinningthreads=0 idlethreads=38 runqueue=16 [1 0 0 1 1 1 4 0 4 1 0 0 2 2 0 2 0 3 0 1 0 0 0 0 7 3 2 0 1 1 7 0]SCHED 308823ms: gomaxprocs=32 idleprocs=27 threads=75 spinningthreads=0 idlethreads=65 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]SCHED 313372ms: gomaxprocs=32 idleprocs=31 threads=75 spinningthreads=0 idlethreads=69 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]gc 30 @308 .778s 0%: 0.89+5809+2.8 ms clock, 28+9849/7842/385+92 ms cpu, 11441->11607->5897 MB, 11734 MB goal, 32 P

Go manages memory for you via garbage collection. It prioritises consistently low latency garbage collection periodically, at the expense of higher CPU usage. To be honest, I didn’t know a huge amount about the inner workings of the Go garbage collector. In production, GC takes up so little time as to not care about, so I’d never concerned myself with it. Upon reading the docs, Go does give you one knob to tune for garbage collection: GOGC . As mentioned, ‘the GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage’. Richard Hudson describes the tradeoff perfectly: ‘If you want to lower the total time spent in GC, increase GOGC. If you want to trade more GC time for less memory, lower GOGC’.

Luckily, we have plenty of RAM to trade on this machine. Setting GOGC to 500 had instant effects — increasing throughput significantly, whilst taking around 70GB of RAM at maximum. The reduction in allocations, coupled with increasing GOGC decreased the time per record from 300μs to 110μs — a huge improvement! Building with Go 1.8RC3 instead of Go 1.6 took our time down a further 25μs to 80μs per record. I should have tried that at the start, but hey.