I've been following the commits to the Go project for some time now. Occasionally someone will post a commit with benchmarks showing how much the commit improves performance along some axis or another. In this commit, they've increased the performance of division by 7 (a notoriously tricky number to divide by) by about 40% on machines running the ARM instruction set. I'm not 100% sure what the commit does, but it switches around the instructions that get generated when you do long division on ARM in a way that makes things faster.

Anyway, I wanted to learn how to do benchmarks, and practice making something faster. Some motivation came along when Uber released their go-zap logging library, with a set of benchmarks showing my friend Alan's logging library as the worst performer. So I thought it would be a good candidate for benchmarking.

Fortunately Alan has already included a set of benchmarks in the test suite. You can run them by cloning the project and then calling the following:

go test -run=^$ -bench=.

You need to pass -run=^$ to exclude all tests in the test suite, otherwise all of the tests will run and also all of the benchmarks. We only care about the benchmarks, so -run=^$ filters out every argument.

We get some output!

BenchmarkStreamNoCtx-4 300000 5735 ns/op BenchmarkDiscard-4 2000000 856 ns/op BenchmarkCallerFileHandler-4 1000000 1980 ns/op BenchmarkCallerFuncHandler-4 1000000 1864 ns/op BenchmarkLogfmtNoCtx-4 500000 3866 ns/op BenchmarkJsonNoCtx-4 1000000 1816 ns/op BenchmarkMultiLevelFilter-4 2000000 1015 ns/op BenchmarkDescendant1-4 2000000 857 ns/op BenchmarkDescendant2-4 2000000 872 ns/op BenchmarkDescendant4-4 2000000 1029 ns/op BenchmarkDescendant8-4 1000000 1018 ns/op BenchmarkLog15AddingFields-4 50000 29492 ns/op BenchmarkLog15WithAccumulatedContext-4 50000 33599 ns/op BenchmarkLog15WithoutFields-4 200000 9417 ns/op PASS ok github.com/inconshreveable/log15 30.083s

The name on the left is the benchmark name. The number (4) is the number of CPU's used for the benchmark. The number in the middle is the number of test runs; to get a good benchmark, you want to run a thing as many times as feasible, then divide the total runtime by the number of test runs. Otherwise you run into problems like coordinated omission and weighting the extreme outliers too much, or failing to weight them at all.

To get a "good" benchmark, you want to try to isolate the running code from anything else running on the machine. For example, say you run the tip benchmarks while a Youtube video is playing, make a change, and then run the benchmarks while nothing is playing. Video players require a lot of CPU/RAM to play videos, and all other things being equal, the benchmark is going to be worse with Youtube running. There are a lot of ways to accidentally bias the results as well, for example you might get bored with the tip benchmarks and browse around, then make a change and observe the console intensely to see the new results. You're biasing the results simply by not switching tabs or screens!

If you have access to a Linux machine with nothing else running on it, that's going to be your best bet for benchmarking. Otherwise, shut down as many other programs as are feasible on your main machine before starting any benchmark tests.

Running a benchmark multiple times can also be a good way to compensate for environmental effects. Russ Cox's benchstat program is very useful for this; it gathers many runs of a benchmark, and tells you whether the results are statistically significant. Run your benchmark with the -count flag to run it multiple times in a row:

go test -count=20 -run=^$ -bench=Log15AddingFields | tee -a master-benchmark

Do the same for your change, writing to a different file ( change-benchmark ), then run benchstat:

benchstat master-benchmark change-benchmark

You'll get really nice looking output. This is generally the output used by the Go core development team to print benchmark results.

$ benchstat benchmarks/tip benchmarks/early-time-exit name old time/op new time/op delta StreamNoCtx-4 3.60µs ± 6% 3.17µs ± 1% -12.02% (p=0.001 n=8+6) Discard-4 837ns ± 1% 804ns ± 3% -3.94% (p=0.001 n=7+6) CallerFileHandler-4 1.94µs ± 2% 1.88µs ± 0% -3.01% (p=0.003 n=7+5) CallerFuncHandler-4 1.72µs ± 3% 1.65µs ± 1% -3.98% (p=0.001 n=7+6) LogfmtNoCtx-4 2.39µs ± 2% 2.04µs ± 1% -14.69% (p=0.001 n=8+6) JsonNoCtx-4 1.68µs ± 1% 1.66µs ± 0% -1.08% (p=0.001 n=7+6) MultiLevelFilter-4 880ns ± 2% 832ns ± 0% -5.44% (p=0.001 n=7+6) Descendant1-4 855ns ± 3% 818ns ± 1% -4.28% (p=0.000 n=8+6) Descendant2-4 872ns ± 3% 830ns ± 2% -4.87% (p=0.001 n=7+6) Descendant4-4 934ns ± 1% 893ns ± 1% -4.41% (p=0.001 n=7+6) Descendant8-4 990ns ± 2% 958ns ± 2% -3.29% (p=0.002 n=7+6)

OK! So now we have a framework for measuring whether a change helps or hurts.

How can I make my code faster?

Good question! There's no one answer for this. In general, there are three strategies:

Figure out a way to do less work than you did before. Avoid doing an expensive computation where it's not necessary, exit early from functions, &c.

Do the same work, but in a faster way; use a different algorithm, or use a different function, that's faster.

Do the same work, but parallelize it across multiple CPU's, or threads.

Each technique is useful in different places, and it can be hard to predict where you'll be able to extract performance improvements. It is useful to know how expensive various operations are, so you can evaluate the costs of a given operation.

It's also easy to spend a lot of time "optimizing" something only to realize it's not the bottleneck in your program. If you optimize something that takes 5% of the code's execution time, the best overall speedup you can get is 5%, even if you get the code to run instantaneously. Go's test framework has tools for figuring out where your code is spending the majority of its time. To get the best use out of them, focus on profiling the code execution for a single benchmark. Here, I'm profiling the StreamNoCtx benchmark in the log15 library.

$ go test -cpuprofile=cpu.out -benchmem -memprofile=mem.out -run=^$ -bench=StreamNoCtx -v BenchmarkStreamNoCtx-4 500000 3502 ns/op 440 B/op 12 allocs/op

This will generate 3 files: cpu.out , mem.out , and log15.test . These are binary files. You want to use the pprof tool to evaluate them. First let's look at the CPU profile; I've started it and then run top10 to get the top 10 functions.

$ go tool pprof log15.test cpu.out Entering interactive mode (type "help" for commands) (pprof) top5 560ms of 1540ms total (36.36%) Showing top 5 nodes out of 112 (cum >= 60ms) flat flat% sum% cum cum% 180ms 11.69% 11.69% 400ms 25.97% runtime.mallocgc 160ms 10.39% 22.08% 160ms 10.39% runtime.mach_semaphore_signal 100ms 6.49% 28.57% 260ms 16.88% github.com/inconshreveable/log15.escapeString 60ms 3.90% 32.47% 70ms 4.55% bytes.(*Buffer).WriteByte 60ms 3.90% 36.36% 60ms 3.90% runtime.stringiter2

The top 5 functions are responsible for 36% of the program's runtime. flat is how much time is spent inside of a function, cum shows how much time is spent in a function, and also in any code called by a function. Of these 5, runtime.stringiter2 , runtime.mallocgc , and runtime.mach_semaphore_signal are not good candidates for optimization. They are very specialized pieces of code, and they're part of the Go runtime, so changes need to pass all tests and get approved by the Go core development team. We could potentially figure out how to call them less often though - mallocgc indicates we are creating lots of objects. Maybe we could figure out how to create fewer objects.

The likeliest candidate for improvement is the escapeString function in our own codebase. The list function is super useful for checking this.

(pprof) list escapeString ROUTINE ======================== github.com/inconshreveable/log15.escapeString in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go 30ms 330ms (flat, cum) 23.40% of Total 10ms 10ms 225:func escapeString(s string) string { . . 226: needQuotes := false . 90ms 227: e := bytes.Buffer{} . . 228: e.WriteByte('"') 10ms 50ms 229: for _, r := range s { . . 230: if r <= ' ' || r == '=' || r == '"' { . . 231: needQuotes = true . . 232: } . . 233: . . 234: switch r { . . 235: case '\\', '"': . . 236: e.WriteByte('\\') . . 237: e.WriteByte(byte(r)) . . 238: case '

': . . 239: e.WriteByte('\\') . . 240: e.WriteByte('n') . . 241: case '\r': . . 242: e.WriteByte('\\') . . 243: e.WriteByte('r') . . 244: case '\t': . . 245: e.WriteByte('\\') . . 246: e.WriteByte('t') . . 247: default: . 100ms 248: e.WriteRune(r) . . 249: } . . 250: } . 10ms 251: e.WriteByte('"') . . 252: start, stop := 0, e.Len() . . 253: if !needQuotes { . . 254: start, stop = 1, stop-1 . . 255: } 10ms 70ms 256: return string(e.Bytes()[start:stop])

The basic idea here is to write a string, but add a backslash before double quotes, newlines, and tab characters. Some ideas for improvement:

We create a bytes.Buffer{} at the beginning of the function. We could keep a Pool of buffers, and retrieve one, so we don't need to allocate memory for a buffer each time we escape a string.

If a string doesn't contain a double quote, newline, tab, or carriage return, it doesn't need to be escaped. Maybe we can avoid creating the Buffer entirely for that case, if we can find a fast way to check whether a string has those characters in it.

If we call WriteByte twice in a row, we could probably replace it with a WriteString(), which will use a copy to move two bytes, instead of growing the array twice.

We call e.Bytes() and then cast the result to a string. Maybe we can figure out how to call e.String() directly.

You can then look at how much time is being spent in each area to get an idea of how much any given idea will help your benchmarks. For example, replacing WriteByte with WriteString probably won't save much time; you're at most saving the time it takes to write every quote and newline, and most strings are made up of letters and numbers and space characters instead. (It doesn't show up at all in the benchmark but that's because the benchmark writes the phrase "test message" over and over again, and that doesn't have any escape-able characters).

That's the CPU benchmark; how much time the CPU spends running each function in the codebase. There's also the memory profile; how much memory each function allocates. Let's look at that! We have to pass one of these flags to pprof to get it to show memory information.

Sample value selection option (for heap profiles): -inuse_space Display in-use memory size -inuse_objects Display in-use object counts -alloc_space Display allocated memory size -alloc_objects Display allocated object counts

Let's pass one. Notice in this case, the top 5 functions allocate 96% of the objects:

go tool pprof -alloc_objects log15.test mem.out (pprof) top5 6612331 of 6896359 total (95.88%) Showing top 5 nodes out of 18 (cum >= 376843) flat flat% sum% cum cum% 2146370 31.12% 31.12% 6612331 95.88% github.com/inconshreveable/log15.LogfmtFormat.func1 1631482 23.66% 54.78% 1631482 23.66% github.com/inconshreveable/log15.escapeString 1540119 22.33% 77.11% 4465961 64.76% github.com/inconshreveable/log15.logfmt 917517 13.30% 90.42% 1294360 18.77% github.com/inconshreveable/log15.formatShared 376843 5.46% 95.88% 376843 5.46% time.Time.Format

Let's look at a function:

ROUTINE ======================== github.com/inconshreveable/log15.logfmt in /Users/kevin/code/go/src/github.com/inconshreveable/log15/format.go 1540119 4465961 (flat, cum) 64.76% of Total . . 97: if i != 0 { . . 98: buf.WriteByte(' ') . . 99: } . . 100: . . 101: k, ok := ctx[i].(string) . 2925842 102: v := formatLogfmtValue(ctx[i+1]) . . 103: if !ok { . . 104: k, v = errorKey, formatLogfmtValue(k) . . 105: } . . 106: . . 107: // XXX: we should probably check that all of your key bytes aren't invalid . . 108: if color > 0 { . . 109: fmt.Fprintf(buf, "\x1b[%dm%s\x1b[0m=%s", color, k, v) . . 110: } else { 1540119 1540119 111: fmt.Fprintf(buf, "%s=%s", k, v) . . 112: } . . 113: }

In the common case on line 111 (when color = 0), we're calling fmt.Fprintf to write the key, then an equals sign, then the value. Fprintf also has to allocate memory for its own byte buffer, then parse the format string, then interpolate the two variables. It might be faster, and avoid allocations, to just call buf.WriteString(k) , then write the equals sign, then call buf.WriteString(v) . But you'd want to benchmark it first to double check!

Conclusion

Using a combination of these techniques, I was able to improve the performance of log15 by about 30% for some common code paths, and reduce memory allocations as well. I was not expecting this, but I also found a way to speed up JSON encoding in the Go standard library by about 20%!

Want someone to benchmark/improve performance in your company's application, or teach your team more about benchmarking? I am available for consulting; email me and let's set something up!

Liked what you read? I am available for hire.