Need some help with analyzing whether algorithm A performs faster than B? Or can I know whether method A performs better than method B? Or is approach A to send JSON response to the client performs better than approach B? How to find answers for everyday performance related questions?

These questions can be answered easily if you are happened to program in Golang. Golang comes bundled with facilities like benchmarking and profiling which helps achieve performance oriented development easier than ever before.

Lets take a look at this simple code snippet which has http handler to parse the query parameter -> convert them to numbers -> add them -> send html result back to the client using a template. What follows the code is also its benchmark and the simple template used.

Running the benchmark

$ go test -run=x -bench=.

Now lets perform CPU profiling to identify the CPU hotspots

go test -run=^$ -bench=. -cpuprofile=cpu.out

This will create an executable that ends with .test and the profile information will be stored in profile.cpu

$ls

bench.test cpu.out simple_add.go simple_add_test.go template.html

lets look at the type of bench.test

$file bench.test

bench.test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), not stripped

Before we begin our optimization lets run the benchmarks again and save its results into a file for later comparison.

$ go test -run=xxx -bench=. | tee bench0

Now lets analyse the CPU profile

go tool pprof bench.test cpu.out

It opens up an interactive cpu profiler

You can check the top20 consuming functions

(pprof) top20

1340ms of 2230ms total (60.09%)

Dropped 78 nodes (cum <= 11.15ms)

Showing top 20 nodes out of 154 (cum >= 220ms)

flat flat% sum% cum cum%

320ms 14.35% 14.35% 320ms 14.35% runtime/internal/atomic.Xchg

120ms 5.38% 19.73% 120ms 5.38% runtime/internal/atomic.Xadd

110ms 4.93% 24.66% 850ms 38.12% runtime.findrunnable

70ms 3.14% 27.80% 350ms 15.70% runtime.mallocgc

70ms 3.14% 30.94% 360ms 16.14% runtime.mapassign1

70ms 3.14% 34.08% 70ms 3.14% runtime.usleep

50ms 2.24% 36.32% 50ms 2.24% runtime.acquirep1

50ms 2.24% 38.57% 50ms 2.24% runtime.heapBitsSetType

50ms 2.24% 40.81% 80ms 3.59% runtime.heapBitsSweepSpan

50ms 2.24% 43.05% 90ms 4.04% runtime.scanobject

50ms 2.24% 45.29% 50ms 2.24% runtime.stringiter2

50ms 2.24% 47.53% 60ms 2.69% syscall.Syscall

40ms 1.79% 49.33% 40ms 1.79% runtime.memclr

40ms 1.79% 51.12% 40ms 1.79% runtime.memmove

40ms 1.79% 52.91% 230ms 10.31% runtime.newarray

40ms 1.79% 54.71% 100ms 4.48% text/template.goodName

30ms 1.35% 56.05% 30ms 1.35% runtime.(*mspan).sweep.func1

30ms 1.35% 57.40% 70ms 3.14% runtime.evacuate

30ms 1.35% 58.74% 30ms 1.35% runtime.releasep

30ms 1.35% 60.09% 220ms 9.87% runtime.unlock

Lets see the CPU consumption by cumulative time

(pprof) top --cum

0.13s of 2.23s total ( 5.83%)

Dropped 78 nodes (cum <= 0.01s)

Showing top 10 nodes out of 154 (cum >= 0.68s)

flat flat% sum% cum cum%

0 0% 0% 1.33s 59.64% runtime.goexit

0 0% 0% 1.11s 49.78% _/home/hackintosh/bench.BenchmarkHandleStructAdd

0 0% 0% 1.11s 49.78% testing.(*B).launch

0 0% 0% 1.11s 49.78% testing.(*B).runN

0.01s 0.45% 0.45% 1.10s 49.33% _/home/hackintosh/bench.handleStructAdd

0 0% 0.45% 0.89s 39.91% runtime.mcall

0.01s 0.45% 0.9% 0.87s 39.01% runtime.schedule

0.11s 4.93% 5.83% 0.85s 38.12% runtime.findrunnable

0 0% 5.83% 0.84s 37.67% runtime.goexit0

0 0% 5.83% 0.68s 30.49% html/template.(*Template).Execute

Cumulative time is useful in understanding the execution tree and its cumulative CPU usage

The most useful feature is to be able to get the execution tree with its CPU usage in the form a image

web

Or if you wish to get the result in a pdf you could try the following the command

go tool pprof — pdf bench.test cpu.out > cpu0.pdf

If you desire to have detailed look into it here’s the link to the pdf

Lets zoom into the tree and see the places where CPU has been utilized substantially

Now if you trace the lines branching out from benchmark.HandleStructAdd almost one third of the cumulative time on the function is being spent on parsing the file.