Reduce allocation in Go code

I’ve implemented parameter interpolation in go-sql-driver/mysql. You can enable it by adding “interpolateParams=true” option to dsn.

Why this feature is important is described in here and here. I don’t say lot about it here.

When writing low level library like MySQL driver, you can’t assume user’s performance requirements. Especially, avoiding memory allocations is important. Go’s GC is not so fast. And everyone want memory usage of their server is stable.

So I’ve reduced allocations in the interpolation as possible. This post describes how I did it and coding tips to avoid allocations.

1. Write it correct.

Since it was new feature, I wrote it quickly like writing Python code. (Go can be used like Python or C!).

I’ve made pull request with “[RFC]” in title. We’ve discussed about feature and rough design first.

2. Write benchmark program.

Before starting tuning, I’ve prepared benchmark program to confirm performance difference.

Writing benchmark program in Go is easy: Write function like BenchmarkXxxx(b *testing.B) in xxx_test.go . To see allocations, call b.ReportAllocs() in it.

Here is benchmark function I’ve wrote:

func BenchmarkInterpolation(b *testing.B) { mc := &mysqlConn{ cfg: &config{ interpolateParams: true, loc: time.UTC, }, maxPacketAllowed: maxPacketSize, maxWriteSize: maxPacketSize - 1, } args := []driver.Value{ int64(42424242), float64(math.Pi), false, time.Unix(1423411542, 807015000), []byte("bytes containing special chars ' \" \a \x00"), "string containing special chars ' \" \a \x00", } q := "SELECT ?, ?, ?, ?, ?, ?" b.ReportAllocs() b.ResetTimer() for i := 0; i < b.N; i++ { _, err := mc.interpolateParams(q, args) if err != nil { b.Fatal(err) } } }

To run it:

$ go test -bench=BenchmarkInterpolation PASS BenchmarkInterpolation 300000 3887 ns/op 1144 B/op 15 allocs/op ok github.com/go-sql-driver/mysql 2.386s

3. Identify allocations

When setting allocfreetrace=1 to environment variable GODEBUG , you can see stacktrace where allocation occures (reference).

But using it while running go test -bench=BenchmarkInterpolation makes huge log including large noise. To reduce log, build mysql.test with go test -c and run it.

$ go test -c $ GODEBUG=allocfreetrace=1 ./mysql.test -test.run=none -test.bench=BenchmarkInter -test.benchtime=10ms 2>trace.log PASS BenchmarkInterpolation 5000 4095 ns/op 1144 B/op 15 allocs/op

-test.run=none prevents running tests before benchmark. -test.benchtime=10ms reduces execution time and log size.

Now I have trace.log . Open it with vim and search interpolateParams . Cut unnecessary stacktrace before it. I can see stacktrace like this:

tracealloc(0xc2080100e0, 0x70, string) goroutine 5 [running]: runtime.mallocgc(0x70, 0x22e4e0, 0x0, 0x2) /usr/local/Cellar/go/1.4.1/libexec/src/runtime/malloc.go:327 +0x32a fp=0xc20802ea60 sp=0xc20802e9b0 runtime.newarray(0x22e4e0, 0x7, 0x15c5e) /usr/local/Cellar/go/1.4.1/libexec/src/runtime/malloc.go:365 +0xc1 fp=0xc20802ea98 sp=0xc20802ea60 runtime.makeslice(0x2229c0, 0x7, 0x7, 0x0, 0x0, 0x0) /usr/local/Cellar/go/1.4.1/libexec/src/runtime/slice.go:32 +0x15c fp=0xc20802eae0 sp=0xc20802ea98 strings.genSplit(0x30c190, 0x17, 0x2e1f10, 0x1, 0x0, 0x7, 0x0, 0x0, 0x0) /usr/local/Cellar/go/1.4.1/libexec/src/strings/strings.go:287 +0x14d fp=0xc20802eb60 sp=0xc20802eae0 strings.Split(0x30c190, 0x17, 0x2e1f10, 0x1, 0x0, 0x0, 0x0) /usr/local/Cellar/go/1.4.1/libexec/src/strings/strings.go:325 +0x76 fp=0xc20802ebb0 sp=0xc20802eb60 github.com/go-sql-driver/mysql.(*mysqlConn).interpolateParams(0xc20802eed0, 0x30c190, 0x17, 0xc20802ee70, 0x6, 0x6, 0x0, 0x0, 0x0, 0x0) /Users/inada-n/go1.4/src/github.com/go-sql-driver/mysql/connection.go:180 +0x86 fp=0xc20802ed38 sp=0xc20802ebb0 github.com/go-sql-driver/mysql.BenchmarkInterpolation(0xc20806a400) /Users/inada-n/go1.4/src/github.com/go-sql-driver/mysql/benchmark_test.go:240 +0x437 fp=0xc20802ef58 sp=0xc20802ed38 testing.(*B).runN(0xc20806a400, 0x1) /usr/local/Cellar/go/1.4.1/libexec/src/testing/benchmark.go:124 +0x95 fp=0xc20802ef68 sp=0xc20802ef58 testing.(*B).launch(0xc20806a400) /usr/local/Cellar/go/1.4.1/libexec/src/testing/benchmark.go:199 +0x78 fp=0xc20802efd8 sp=0xc20802ef68 runtime.goexit() /usr/local/Cellar/go/1.4.1/libexec/src/runtime/asm_amd64.s:2232 +0x1 fp=0xc20802efe0 sp=0xc20802efd8 created by testing.(*B).run /usr/local/Cellar/go/1.4.1/libexec/src/testing/benchmark.go:179 +0x3e ...

This stacktrace shows strings.Split() called from interpolateParams() cause allocation. You can see also filename and line number ( connection.go:180 ).

Here is the code:

func (mc *mysqlConn) interpolateParams(query string, args []driver.Value) (string, error) { chunks := strings.Split(query, "?") if len(chunks) != len(args)+1 { return "", driver.ErrSkip }

4. Tuning up

OK. It’s time to start optimize. Let’s record current benchmark result. I use it later.

$ go test -bench=BenchmarkInterpolation | tee old.txt PASS BenchmarkInterpolation 500000 3942 ns/op 1144 B/op 15 allocs/op ok github.com/go-sql-driver/mysql 3.219s

4.1. Avoid concatenate strings, Use []byte and append

My first version is very Pythonic. When interpolating db.Exec("SELECT SLEEP(?)", 42) , my code did like this:

parts := []string{"SELECT SLEEP(", escape(42), ")"} query := strings.Join(parts, "")

Since Go’s string is immutable, making temporary string cause allocations. So I replaced stirngs.Join() with []byte and append . I can use strconv.AppendInt() and strconv.AppendFloat() to avoid temporal string when formatting int64 and float64.

Now my code looks like:

buf := make([]byte, 0, estimatedSize) argPos := 0 for _, c := range []byte(query) { if c != '?' { buf = append(buf, c) continue } arg := args[argPos] argPos++ if arg == nil { buf = append(buf, []byte("NULL")...) continue } switch v := arg.(type) { case int64: buf = strconv.AppendInt(buf, v, 10) ...

4.2. Use benchcmp

I’ve made first optimization. Let’s measure it.

benchcmp is famous tool to compare benchmark result. You can include it’s output in commit log. It makes your commit looks cool.

$ go get -u golang.org/x/tools/cmd/benchcmp $ go test -bench=BenchmarkInterpolation > new.txt $ benchcmp old.txt new.txt benchmark old ns/op new ns/op delta BenchmarkInterpolation 3942 2573 -34.73% benchmark old allocs new allocs delta BenchmarkInterpolation 15 6 -60.00% benchmark old bytes new bytes delta BenchmarkInterpolation 1144 560 -51.05%

You can see number of allocation is reduced from 15 to 6. Execution time and memory consumption are also reduced.

4.3. Avoid Time.Format()

time.Time doesn’t provides methods like strconv.AppendInt() . To avoid making temporary string, I’ve wrote it myself.

Before:

case time.Time: if v.IsZero() { buf = append(buf, []byte("'0000-00-00'")...) } else { fmt := "'2006-01-02 15:04:05.999999'" if v.Nanosecond() == 0 { fmt = "'2006-01-02 15:04:05'" } s := v.In(mc.cfg.loc).Format(fmt) buf = append(buf, []byte(s)...) }

After (including other improvements):

case time.Time: if v.IsZero() { buf = append(buf, "'0000-00-00'"...) } else { v := v.In(mc.cfg.loc) v = v.Add(time.Nanosecond * 500) // To round under microsecond year := v.Year() year100 := year / 100 year1 := year % 100 month := v.Month() day := v.Day() hour := v.Hour() minute := v.Minute() second := v.Second() micro := v.Nanosecond() / 1000 buf = append(buf, []byte{ '\'', digits10[year100], digits01[year100], digits10[year1], digits01[year1], '-', digits10[month], digits01[month], '-', digits10[day], digits01[day], ' ', digits10[hour], digits01[hour], ':', digits10[minute], digits01[minute], ':', digits10[second], digits01[second], }...) if micro != 0 { micro10000 := micro / 10000 micro100 := micro / 100 % 100 micro1 := micro % 100 buf = append(buf, []byte{ '.', digits10[micro10000], digits01[micro10000], digits10[micro100], digits01[micro100], digits10[micro1], digits01[micro1], }...) } buf = append(buf, '\'') }

It reduces two allocations:

benchmark old allocs new allocs delta BenchmarkInterpolation 6 4 -33.33%

4.4. Avoid range when iterating string

Usually, range is used for iterating slice. But for _, c := range s (where s is string ) produces rune , not byte .

My first code used for i, c := range([]byte(s)) { . But Go 1.4 make new slice and copy s to it. (Go 1.5 optimize it out).

So I’ve used C-like for loop:

@@ -210,8 +210,8 @@ func (mc *mysqlConn) interpolateParams(query string, args []driver.Value) (strin buf := make([]byte, 0, estimated) argPos := 0 - // Go 1.5 will optimize range([]byte(string)) to skip allocation. - for _, c := range []byte(query) { + for i := 0; i < len(query); i++ { + c := query[i] if c != '?' {

It reduces one allocation.

benchmark old allocs new allocs delta BenchmarkInterpolation 4 3 -25.00%

4.5. append string to []byte directly

Concatenating slices can be written like buf = append(buf, token...) . Basically, buf and token should have same type. But there is one exception: when buf is []byte , token can be string .

@@ -210,17 +210,19 @@ func (mc *mysqlConn) interpolateParams(query string, args []driver.Value) (strin argPos := 0 for i := 0; i < len(query); i++ { - c := query[i] - if c != '?' { - buf = append(buf, c) - continue + q := strings.IndexByte(query[i:], '?') + if q == -1 { + buf = append(buf, query[i:]...) + break } + buf = append(buf, query[i:i+q]...) + i += q arg := args[argPos] argPos++ if arg == nil { - buf = append(buf, []byte("NULL")...) + buf = append(buf, "NULL"...) continue } ...

I couldn’t reduce allocation from this change. But I could improve readability a lot.

4.6. Make two same function: for []byte and for string

MySQL requires same escaping for binary and string.

Sadly, since Go doesn’t have “read only slice”, converting between string and []byte cause allocation and copy. Current compiler can optimize it only for very simple cases.

So I’ve made mostly same two functions for []byte and string . Generally speaking, duplicated code is evil. But TEXT and BLOB column may store large data. I want to avoid allocating and copying large data. So I did evil.

It reduces one more allocation.

benchmark old ns/op new ns/op delta BenchmarkInterpolation 2463 2118 -14.01% benchmark old allocs new allocs delta BenchmarkInterpolation 3 2 -33.33% benchmark old bytes new bytes delta BenchmarkInterpolation 496 448 -9.68%

4.7. Use send/receive buffer for scratchpad

Last two allocation is: (1) Make scratchpad buffer for interpolate, and (2) convert it to string.

I can remove them by create MySQL packet directly. But it may breaks separation of packet layer and higher layer. So I didn’t it for now.

Instead, I borrowed send/receive buffer and use it as scratchpad buffer. It reduces one allocation.

benchmark old ns/op new ns/op delta BenchmarkInterpolation 1900 1403 -26.16% benchmark old allocs new allocs delta BenchmarkInterpolation 2 1 -50.00% benchmark old bytes new bytes delta BenchmarkInterpolation 448 160 -64.29%

Conclusion

Go’s compiler is not intelligent compared with gcc or clang. You should use some tips to reduce allocation.

But Go’s runtime is great. I can measure and find allocation quickly. I think it is more important than magical optimization.