Solve performance hotspots like solving a puzzle.



You may also like: Optimizing Database Performance and Efficiency

Introduction

Igneous is an unstructured data management company. We move bytes back and forth — that’s what we do. From a low-level programming perspective, we want to touch each byte as few times as possible in order to make things go fast. In a garbage-collected language like Go, that also extends to minimizing how many bytes we allocate for each byte that is moved. One technique at hand for minimizing memory allocations is memory pooling.

So, what is memory pooling, and how do we get benefits from it? This article will talk through the answers to those questions, provide some ways to identify which “hot spots” are good candidates for memory pooling, and then will give an example of one that can be fixed with the memory pool built in to the Go standard library, and one that ought to be fixed with a custom-built memory pool.

What Is Memory Pooling?

A memory pool refers loosely to a group of memory blocks that are allocated and freed under programmer control. This is an old technique going back to the dawn of programming, but one which finds important uses in a garbage-collected language like Go, where using the language’s built-in memory allocator for large numbers of fixed-size allocations will typically result in a performance hotspot.

The reasons are several: when the built-in memory allocator is used, Go must not only allocate a block of memory on demand but also zero its bytes. Furthermore, the pressure is put on the garbage collector to scavenge the blocks once they have fallen from use — which keeps the CPU busy and isn’t sustainable.

In fact, the designers of Go recognized this need and supplied a pool manager in the standard library: sync.Pool . sync.Pool allows Go programmers to allocate and free memory manually, circumventing the language’s built-in allocator in order to improve performance. A worked-out example is probably the best way to show it in use, and we will start with one here.

How Does One Use It?

The first thing one does before even adding something like a pooled allocation is to find a performance hotspot that could benefit from it. There are tradeoffs for using pooled memory, and one of those is decreased reliability: each memory block must now be allocated and freed “by hand” (under programmer control), with the attendant risk of accidentally allocating or freeing the same block twice, thereby setting up the possibility for data corruption. For the purposes of illustration, we can write a contrived benchmark to show a performance hotspot, which also serves to show the use of all of Go’s unit benchmark and profiling tools.

This is our simple benchmark: take a buffer which represents the gzip compression of 1024 copies of the string “how now brown cow”, and decompress it N times. We use the allocator in which the benchmark is given to allocating a gzip “reader” (decompressor).

// A contrived sample benchmark: // // Allocate a gzip reader and decompress the bytes in “gzcow”, discarding the result. func gunziploop(b *testing.B, m pool) { for i := 0; i < b.N; i++ { r := m.Get().(*gzip.Reader) r.Reset(bytes.NewReader(gzcow.Bytes())) if n, err := io.Copy(ioutil.Discard, r); err != nil { b.Fatal(err) } else if int(n) != 1024*len("how now brown cow") { b.Fatal("bad length") } m.Put(r) } }





This doesn’t run quickly. For our purposes, we’ve just identified a performance hotspot, and now we’ll examine it and try to make it run faster.

By connecting this benchmark through the Get/Put interface we defined (which actually also happens to be the interface which sync.Pool satisfies), first let’s profile how well Go’s built-in memory allocation works here:

// sync.Pool interface definition type pool interface { Get() interface{} Put(x interface{}) } // A nopool "pool" simply allocates a new gzip Reader from the heap each time type nopool struct{} func (*nopool) Get() interface{} { return new(gzip.Reader) } func (*nopool) Put(x interface{}) {} func BenchmarkGunzipNopool(b *testing.B) { gunziploop(b, new(nopool)) }





Running this benchmark we see this number:

; go test -bench Nopool -benchmem -cpuprofile cpu.out BenchmarkGunzipNopool-4 100000 14286 ns/op 41343 B/op 6 allocs/op





After opening the CPU profile associated with this benchmark, we find that the top 10 hits in the CPU profile are:

; go tool pprof *.test cpu.out (pprof) top Showing nodes accounting for 2.11s, 96.35% of 2.19s total Dropped 28 nodes (cum <= 0.01s) Showing top 10 nodes out of 61 flat flat% sum% cum cum% 1.56s 71.23% 71.23% 1.56s 71.23% runtime.pthread_cond_signal 0.26s 11.87% 83.11% 0.26s 11.87% runtime.pthread_cond_wait 0.15s 6.85% 89.95% 0.15s 6.85% runtime.pthread_cond_timedwait_relative_np 0.03s 1.37% 91.32% 0.03s 1.37% runtime.sweepone 0.02s 0.91% 92.24% 0.03s 1.37% compress/flate.(*dictDecoder).tryWriteCopy 0.02s 0.91% 93.15% 0.02s 0.91% compress/flate.(*huffmanDecoder).init 0.02s 0.91% 94.06% 0.02s 0.91% runtime.memclrNoHeapPointers 0.02s 0.91% 94.98% 0.02s 0.91% runtime.memmove 0.02s 0.91% 95.89% 0.02s 0.91% runtime.wbBufFlush1 0.01s 0.46% 96.35% 0.13s 5.94% rand.gunziploop (pprof)





It takes some investigation to see what is going on here, but having seen this a few times, someone experienced with Go will notice that 7 out of the top 10 hits have nothing to do with our benchmark! They are various runtime routines, including condition variable activity (the pthread_cond_xxx lines), which is surprising to see in a single-threaded benchmark until one reflects that the garbage collector in Go is running concurrently with the benchmark. Therefore, it would seem some seven out of ten of the top users of the CPU are related to the garbage collector.

The obvious next thing to do here is taking a heap profile, but before doing this, the CPU profile can also yield hints as to where the allocations may be coming from. Use peek:

(pprof) peek mallocgc Showing nodes accounting for 2.13s, 100% of 2.13s total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 0.02s 66.67% | runtime.makeslice 0.01s 33.33% | runtime.newobject 0 0% 0% 0.03s 1.41% | runtime.mallocgc 0.01s 33.33% | runtime.(*mcache).nextFree 0.01s 33.33% | runtime.nextFreeFast 0.01s 33.33% | runtime.profilealloc ----------------------------------------------------------+------------- (pprof) peek makeslice Showing nodes accounting for 2.13s, 100% of 2.13s total ----------------------------------------------------------+------------- flat flat% sum% cum cum% calls calls% + context ----------------------------------------------------------+------------- 0.02s 100% | compress/flate.(*dictDecoder).init 0 0% 0% 0.02s 0.94% | runtime.makeslice 0.02s 100% | runtime.mallocgc ----------------------------------------------------------+-------------





This is interesting. Without even using a heap profiler, we can see that two-thirds of the calls to mallocgc come from makeslice , and that most of those calls come from initializing a gzip decoder.

Just to confirm this, we can also take a heap profile:

(pprof) top Showing nodes accounting for 4367.40MB, 99.66% of 4382.47MB total Dropped 6 nodes (cum <= 21.91MB) flat flat% sum% cum cum% 3450.62MB 78.74% 78.74% 3450.62MB 78.74% compress/flate.(*dictDecoder).init 841.73MB 19.21% 97.94% 4292.35MB 97.94% compress/flate.NewReader 75.05MB 1.71% 99.66% 75.05MB 1.71% rand.(*nopool).Get 0 0% 99.66% 4292.35MB 97.94% compress/gzip.(*Reader).Reset





Sure enough, most of the memory allocations come from initializing a gzip decoder: some 3.5GB.

Now we can look at the same benchmark with a pooled allocation in place.

Because of the way we set up this example, to begin with, the pool allocation is a gimme: our benchmark already follows the sync.Pool interface, so actually using sync.Pool involves no extra work:

// The sync.Pool benchmark shows the effects of amortizing the // new(gzip.Reader) call across multiple calls to Get() func BenchmarkGunzipPooled(b *testing.B) { gunziploop(b, &sync.Pool{New: new(nopool).Get}) }





The results are dramatically different:

BenchmarkGunzipPooled-4 200000 9332 ns/op 48 B/op 1 allocs/op (pprof) top Showing nodes accounting for 1590ms, 92.98% of 1710ms total Showing top 10 nodes out of 45 flat flat% sum% cum cum% 300ms 17.54% 17.54% 310ms 18.13% compress/flate.(*huffmanDecoder).init 260ms 15.20% 32.75% 260ms 15.20% runtime.memmove 240ms 14.04% 46.78% 250ms 14.62% compress/flate.(*decompressor).huffSym 220ms 12.87% 59.65% 450ms 26.32% compress/flate.(*dictDecoder).tryWriteCopy 210ms 12.28% 71.93% 210ms 12.28% hash/crc32.ieeeCLMUL 130ms 7.60% 79.53% 830ms 48.54% compress/flate.(*decompressor).huffmanBlock 90ms 5.26% 84.80% 90ms 5.26% runtime.memclrNoHeapPointers 50ms 2.92% 87.72% 50ms 2.92% bytes.(*Reader).ReadByte 50ms 2.92% 90.64% 430ms 25.15% compress/flate.(*decompressor).readHuffman 40ms 2.34% 92.98% 50ms 2.92% compress/flate.(*decompressor).Reset





The benchmark runs almost twice as fast, and the CPU profile now shows that all the time is spent in gzip decompression, which is where we hope to be spending time in an application like this. Furthermore, the allocation count has been reduced to one per benchmark iteration (it’s the bytes.NewReader in the benchmark loop).

A parting thought: the benchmem allocation counts shown in the one-line summaries above come in very handy when using unit benchmarking, but when debugging large programs running in the field, reducing a performance problem to a small unit benchmark is not always feasible. Therefore, it’s good to be able to look at a full-program CPU or heap profile and deduce performance problems that way.

When Is a Different Sort of Memory Pooling Appropriate?

The benchmark we just walked through happened to be well suited to sync.Pool , and indeed it’s quite probable that caching a gzip decompressor in exactly this fashion can be found “out in the wild”: if you look at the definition of a gzip Reader, you’ll find that it holds many kilobytes of state and that allocating it from scratch each time from the heap is costly.

Sometimes allocations are not fixed the size and sync.Pool are not the appropriate tool to reach for. Let’s illustrate this with contrived benchmark number two:

// contrived benchmark: read either 10 or maxalloc-1 // bytes from the bigcow reader into an allocated buffer // of that size. Hold 50 allocations at once to exercise // the pooled allocators. func allocloop(b *testing.B, r io.ReadSeeker, m alloc) { for i := 0; i < b.N; i++ { var bufs [50][]byte for i := range bufs { n := 10 if rand.Intn(10) == 0 { n = maxalloc - 1 } bufs[i] = m.Alloc(n) if len(bufs[i]) != n { b.Fatal("dishonest allocator") } } for i := range bufs { r.Seek(0, 0) if _, err := io.ReadFull(r, bufs[i]); err != nil { b.Fatal(err) } } for i := range bufs { m.Free(bufs[i]) } } }





This time our contrived benchmark works by flipping a ten-sided die and then copying either ten bytes or 256 kilobytes from our source repository of 16384 “how now brown cow” strings, depending on whether a “0” turns up. In order to defeat a simple “cache the last value” strategy, our contrived benchmark allocates and frees 50 buffers at once. Go’s memory allocator performs as follows:

// profile heap allocations with a simple wrapper for // the alloc interface; frees are implicit with Go's GC. type heap struct{} func (*heap) Alloc(n int) []byte { return make([]byte, n) } func (*heap) Free([]byte) {} func BenchmarkHeapAlloc(b *testing.B) { allocloop(b, &rcow, &heap{}) }





Gives us this result:

BenchmarkHeapAlloc-4 5000 210195 ns/op 1306198 B/op 50 allocs/op





As we might expect, Go’s allocator has to allocate (and zero), on average, five of the large allocation block of 256 kilobytes (around 1.3 megabytes — shown in the benchmem result above).

How would we try to pool allocate here? Since the allocation is unpredictable and may be small or large, we will often “miss” if we use a one-size-fits-all allocator such as sync.Pool : if the pool gives us a small buffer when we need to make a large allocation, this will result in allocating a new large buffer from the heap.

An implementation for Alloc/Free using sync.Pool looks like this:

type syncpool struct{ sync.Pool } func (s *syncpool) Alloc(n int) []byte { if b, _ := s.Pool.Get().([]byte); cap(b) >= n { return b[:n] } return make([]byte, n) // pool allocation mis-sized } func (s *syncpool) Free(b []byte) { s.Pool.Put(b) }





The allocation, should it come from the pool, is up-sized by re-allocating from the heap as needed. Objects of any size are then placed back in the pool. When we run the benchmark, we get the following numbers:

func BenchmarkSyncAlloc(b *testing.B) { allocloop(b, &rcow, &syncpool{}) } BenchmarkSyncAlloc-4 20000 93332 ns/op 4892 B/op 50 allocs/op





This is a lot better than simple heap allocations, and the smaller “bytes per op” suggests that the pool is doing a great job of caching blocks on its free list. We can do better than this.

One approach we might use to address this is to maintain a list of pages of different sizes such that the requested allocation always fits into a page that is “close enough” in size. For the sake of making the illustration simple, let’s set up a pool allocator which maintains pages in powers of two. The way the allocator works is straightforward: each allocation is “mapped” to a power-of-2-sized bucket from 1 to 256kb:

// pool allocate power-of-two pages up to a 256kb page size const maxalloc = 256 * 1024 // log base 2 of integer n < 2^32 (not shown here) func lg2(n uint32) uint32 type allocator struct { pages []pagelist } type pagelist struct { cache [][]byte }





Each power-of-two “list” is just a slice of []byte . We’ll use Go’s slice operations to append and remove entries from the list, rather than complicate the allocator further by using a linked list. Allocations, which are close in size, will thereby be stored on the same free list. If the allocation pattern is not pathological, this will often result in many allocations of the same size being stored together on a particular free list, without the need to determine what the allocation pattern is beforehand.

Taking a closer look at Alloc , there is one subtle aspect to an allocation: if a matching page is found within the correct bucket, but the page is actually smaller than the size we are looking for, we simply size it up before returning it; it will be placed back on the same free list later anyway.

// Alloc returns a byte slice of length n (though the capacity // may be greater). Alloc does not retain a reference to the // slice so "leaked" memory may be garbage collected by the runtime. func (a *allocator) Alloc(n int) []byte { if uint(n) >= maxalloc { panic("pool alloc: size too big") } if n == 0 { return nil } p := &a.pages[lg2(uint32(n))] var x []byte if l := len(p.cache) - 1; l >= 0 { // cache hit x = p.cache[l] p.cache = p.cache[:l] } if cap(x) < n { // cache miss, or the x found is too small x = make([]byte, n) } return x[:n] }





The Free function now re-threads a page back onto the free list of the appropriate size, even if it was not allocated from that free list in the first place. (Since each Alloc is paired with a Free in the benchmark loop, over time we expect all the allocations to eventually come from their respective free list.)

// Free returns a slice to the pool allocator. It need not have // been allocated via Alloc(). func (a *allocator) Free(b []byte) { if cap(b) == 0 || cap(b) >= maxalloc { return // ignore out-of-range slices } p := &a.pages[lg2(uint32(cap(b)))] p.cache = append(p.cache, b) }





This code can be encapsulated in the benchmark now:

func BenchmarkPoolAlloc(b *testing.B) { allocloop(b, &rcow, &allocator{make([]pagelist, lg2(maxalloc-1)+1)}) }





After running the benchmark, we get:

BenchmarkPoolAlloc-4 20000 57815 ns/op 196 B/op 0 allocs/op





Compare this to our earlier result:

BenchmarkSyncAlloc-4 20000 93332 ns/op 4892 B/op 50 allocs/op





As you can see, this allocator allows us to complete the benchmark in only 60% of the time it took to complete with sync.Pool . In addition, we were able to achieve zero allocations per benchmark iteration (allocs/op), which means that any allocations were amortized away and that the pool allocator is a good fit for this application as we seek to minimize unnecessary allocations.

Conclusion

In order to get a data path written in Go to go fast, you have to do this kind of work all over the place. At Igneous, this relentless focus on efficiency across our data path enables our technology to work at scale, and I hope that this article allows anyone with an interest in Go to understand this important technique.





Further Reading

The Trouble With Troubleshooting for DevOps and Developers

Methodical Approach to Performance Troubleshooting Cloud APIs