Benchmarking Go programs

Benchmarking is an important step in writing software. This may extend beyond the scope of your Go program, but attention should also be given to the Go program itself. If you’re writing APIs or if you’re implementing a parser for a specific file format - benchmarking will tell you how performant your implementation is.

Writing tests in Go is relatively simple - you create a file with the suffix _test.go , and when you run go test it will run the tests contained within. But there’s a specific feature to Go, benchmarks. You write the benchmarks in the same way as you would write tests, but change the function signature from this:

func TestScannerTextBlock(t *testing.T) {

to this:

func BenchmarkScannerTextBlock(b *testing.B) {

As you see, the Testing functions have a “Test” prefix for the function name. In comparison, the Benchmark functions have a “Benchmark” prefix. What changes is also the argument from *testing.T to *testing.B. The types provide functionality to manage tests and benchmarks.

A feature complete example of a benchmark would be this:

func BenchmarkStringJoin1(b *testing.B) { b.ReportAllocs() input := []string{"Hello", "World"} for i := 0; i < b.N; i++ { result := strings.Join(input, " ") if result != "Hello World" { b.Error("Unexpected result: " + result) } } }

And you could run your benchmark like this:

# go test -run=xxx -bench=. -cpuprofile profile_cpu.out

The above command will skip tests (unless you’ve named them ‘xxx’), run all benchmarks, and create a CPU profile for the benchmarking run.

There are a few things to note in that example:

1. Testing in the benchmark

You can write a sanity check test inside the benchmark, which can error out of the benchmark if the results might be invalid. While this falls into the testing territory, there might be errors that can be caught this way from objects that maintain a state that changes over many requests / function calls.

Ir you detect an error, just call the b.Error function and the benchmarks will stop.

2. Benchmark intensity

The usual story when benchmarking code in other programming languages would be that you would log the time taken for a loop, where you would set a loop limit by hand. Go benchmarks are smarter in that respect - they provide a b.N integer that tells the benchmark how many loops it should do, so it roughly takes one second (by default).

-test.benchtime duration approximate run time for each benchmark (default 1s)

3. Reporting Allocations

When running a benchmark, Go can log and report allocations which you make. This logging and reporting is triggered with a call to b.ReportAllocs() . The allocation information may signify a portion of your code that can be optimized in a way so it takes less memory. After running the benchmark, some information regarding B/op and allocs/op will be printed. The complete output might be similar to this:

BenchmarkStringJoin1-4 300000 4351 ns/op 32 B/op 2 allocs/op

This tells us that each iteration of the benchmark allocated on average 32 bytes, with two allocations. With this information we can set out to improve the test, hopefully allocating less memory.

func BenchmarkStringJoin2(b *testing.B) { b.ReportAllocs() input := []string{"Hello", "World"} join := func(strs []string, delim string) string { if len(strs) == 2 { return strs[0] + delim + strs[1]; } return ""; }; for i := 0; i < b.N; i++ { result := join(input, " ") if result != "Hello World" { b.Error("Unexpected result: " + result) } } }

The new benchmark produces the following results:

BenchmarkStringJoin2-4 500000 2440 ns/op 16 B/op 1 allocs/op

So, we managed to halve the allocations and speed up the function at the same time. Looking at the implementation of strings.Join , I think that maybe the optimisation comes from the compiler itself. The closure join which we introduced must be partially optimized and only the allocation for result must remain. The compiler should produce something like this:

func BenchmarkStringJoin3(b *testing.B) { b.ReportAllocs() input := []string{"Hello", "World"} for i := 0; i < b.N; i++ { result := input[0] + " " + input[1]; if result != "Hello World" { b.Error("Unexpected result: " + result) } } }

Where the result would count as one allocation. But in fact, the compiler optimizes the above code even further:

BenchmarkStringJoin3-4 30000000 37.6 ns/op 0 B/op 0 allocs/op

This makes it a bit difficult to point at a specific allocation in the code, and pin point the difference. For example, above you could say that declaring the input variable counts as an allocation - but if you move it into the for loop, the results will vary. The allocs/op for the Join1 and Join2 tests would stay the same, but the Test2, will increase in memory usage:

BenchmarkStringJoin2-4 500000 2440 ns/op 16 B/op 1 allocs/op BenchmarkStringJoin2B-4 300000 5004 ns/op 48 B/op 2 allocs/op

Compiler optimizations are great, but it’s a bit tricky to deduce where the allocations might be coming from. In fact, with added optimisations in the compiler itself, the above benchmarks might improve with time, without changing a single line of source code.

For those not faint of heart, you can run the following command and take a look at what the compiler produces - pure unadulterated assembly.

# go tool objdump app.test > app.asm

4. CPU profile with pprof

While you have some basic information about the time it took for each function to be executed, you might want to drill down when benchmarking more complex functions. This way you might figure out, which function being called takes the most time.

The benchmark command above produces an app.test executable, along with a profile_cpu.out . You can use the pprof tool from the go toolchain to analyze this output:

# go tool pprof app.test profile_cpu.out Entering interactive mode (type "help" for commands) (pprof) top10 8220ms of 10360ms total (79.34%) Dropped 63 nodes (cum <= 51.80ms) Showing top 10 nodes out of 54 (cum >= 160ms) flat flat% sum% cum cum% 2410ms 23.26% 23.26% 4960ms 47.88% runtime.concatstrings 2180ms 21.04% 44.31% 2680ms 25.87% runtime.mallocgc 1200ms 11.58% 55.89% 1200ms 11.58% runtime.memmove 530ms 5.12% 61.00% 530ms 5.12% runtime.memeqbody 530ms 5.12% 66.12% 2540ms 24.52% runtime.rawstringtmp 470ms 4.54% 70.66% 2420ms 23.36% strings.Join 390ms 3.76% 74.42% 2330ms 22.49% app.BenchmarkStringJoin3B 180ms 1.74% 76.16% 1970ms 19.02% runtime.rawstring 170ms 1.64% 77.80% 5130ms 49.52% runtime.concatstring3 160ms 1.54% 79.34% 160ms 1.54% runtime.eqstring

This doesn’t exactly show the complete call graph relationships, so you would know what functions strings.Join is calling for example. Or why only BenchmarkStringJoin3B shows up in the top 10. One way to get that is to generate it as a PDF or SVG with the same pprof tool (you will need dot installed):

# go tool pprof -svg profile_cpu.out > profile_cpu.svg # go tool pprof -pdf profile_cpu.out > profile_cpu.pdf

This will produce a callgraph PDF like this screenshot below:

As we declared several benchmarks, the output is slightly inconvenient if you’re interested only in one. You can run a specific benchmark and only get the output of that run. As Join2B was the worst benchmark, we want to drill into that one specifically:

go test -run=xxx -bench=BenchmarkStringJoin2B$ -cpuprofile profile_2b.out go test -run=xxx -bench=BenchmarkStringJoin2$ -cpuprofile profile_2.out go tool pprof -svg profile_2b.out > profile_2b.svg go tool pprof -svg profile_2.out > profile_2.svg

As the screenshot shows, the 2B benchmark is slower due to extra allocations in the benchmark function. In comparison, the benchmark with the allocation outside the loop allocates memory under runtime.concatstrings . It might still not be a great example of drilling down into the benchmark, it might be better to use flame graphs.

A flame graph gives you the option to drill down top level functions by depth. This means that you can get a full flame graph like this and then drill down.

In order to drill down, you just click on the individual functions on the bottom, that are interesting to you. This will change the graph to display only descendant functions that have been invoked inside.

Generating the graph is as simple as generating the svg/pdf with pprof, but it does require some extra tooling, namely uber/go-torch, which uses a FlameGraph library by Brendan Gregg. A short bash script that checks out all the dependencies and generates the flame graph is bundled here:

#!/bin/bash # install flamegraph scripts if [ ! -d "/opt/flamegraph" ]; then echo "Installing flamegraph (git clone)" git clone --depth=1 https://github.com/brendangregg/FlameGraph.git /opt/flamegraph fi # install go-torch using docker if [ ! -f "bin/go-torch" ]; then echo "Installing go-torch via docker" docker run --net=party --rm=true -it -v $(pwd)/bin:/go/bin golang go get github.com/uber/go-torch # or if you have go installed locally: go get github.com/uber/go-torch fi PATH="$PATH:/opt/flamegraph" bin/go-torch -b profile_cpu.out -f profile_cpu.torch.svg

You can also check out the complete code samples on this github repo. While the above gives you an idea of how benchmarking works in Go, and how to usefully implement it, I’ll follow up with a more complex example. A good way to stay updated is to enter your e-mail below, or follow me on Twitter.

While I have you here...

It would be great if you buy one of my books:

I promise you'll learn a lot more if you buy one. Buying a copy supports me writing more about similar topics. Say thank you and buy my books.

Feel free to send me an email if you want to book my time for consultancy/freelance services. I'm great at APIs, Go, Docker, VueJS and scaling services, among many other things.