A simple start

When you work in this industry for a while, if your introspective, you can discover what type of programmer you are. You might be a programmer obsessed with performance or trying to find clever tricks.

I am neither of those. While I do like to find new techniques or simpler ways to do something, I generally prefer to steal from others who are much smarter than I. I dislike tracking down performance problems in systems unless it is necessary. I tend to be interested in delivering a product that functions with reasonable performance characteristics. I do not analyze assembly output in order to add custom assembly code to take advantage of a specific Intel instruction that does xyz faster (but thank God for a few people I've met who do). I rarely think that is necessary.

So, for reasons unknown to me, I have ended up going down a rabbit hole of exploring some Go concepts in order to make something faster and use less memory. Let me explain.

Recently I have had a need for an unbounded queue. Buffered channels are my go to for a standard queue. And they are nice when you are okay with either:

blocking when the buffer is full

dropping an update when the buffer is full

Neither of those situations was going to work in this case. In reality, the queue's use would have a bound, but I would not know what that bound was ahead of time. I could not drop an update, have an update out of order, or block for an update. However, I could deal with higher memory use while we built up to whatever the bound would be.

I wanted was something that:

FIFO if single reader

pushed new items fast

read items fast

grew when the buffer became full

shrank when buffer space was not required

was unbounded

provide methods that could pop if possible or return !ok

provide methods that blocked until it could pop

if an above method was waiting, it would not eat up the CPU

no goroutines

solved a generic use case, so []interface{}

Note on being Generic: Yeah, I hate []interface{}. Empty interface{} conversion is a time suck and I hate not being typed at compile. However, I'm anti-generics with the proposals I've seen, they are just ugly and I don't want to spend 6 weeks explaining them to newbies. I'm fairly convinced that with the brain power behind trying to come up with generics in Go that if there was a good answer it would have happened. They will either screw up compile time or make the code more difficult to understand. With that said, this code is probably ripe for using Genny or some other type of pre-compiled code generator to save you.

To test implementations, I eventually decided on an interface I wanted for testing:

type Buffer interface{ // Push an item on to the buffer. If we have reached maximum // buffer size, return false. Push(item interface{}) (ok bool) // Force is like Push() except it will block until we can push onto the buffer. Force(item interface{} // Pop retrieves a value from the buffer. It returns false if the buffer // is empty. Pop() (val interface{}, ok bool) // Pull is like Pop() except it will block until it can retrieve a value. Pull() interface{} }

This isn't the most Goish interface. Pull() for example could return a channel that keeps returning values until another method like Close() was called. But I wanted to avoid spinning off goroutines. My final version might include these.

But before I go any further, other than Attempt #1, everything I am doing here is wrong. This is a near-perfect example of over-optimizing. When writing an application you should concentrate on:

Application structure

Nice interfaces to data structures that can have the internals switched out

Good data structures

Performance optimization should come after testing, tracing, etc... of bottlenecks that are unacceptable in your application. Good data structures lead to more optimized algorithms if you need them. And if you abstract your interfaces cleanly, you can always optimize behind the scenes.

For reference, you can find the various code that I show or don't show here:

Different versions of the FIFO queue:

http://github.com/johnsiilver/golib/queue/fifo/unbounded/experimental/

The final version:

http://github.com/johnsiilver/golib/queue/fifo/unbounded/

Attempt 1: Not Being Clever

My first crack at this was to avoid being clever at all and use Go in the most simple way possible.

This involved:

using a slice to hold the queue

using a slice's sliding window over the array to dequeue entries

use a mutex when enqueuing or dequeuing to the slice

a channel to prevent providing sleeping routines when doing blocking calls

Here's what the first attempt looked like:

package unbounded import "sync" // Buffer provides a first in first out non-blocking (but not lock-free) Buffer. type Buffer struct { mu sync.Mutex nextCh chan interface{} data []interface{} } // New is the contructor for Buffer. func New() *Buffer { return &Buffer{nextCh: make(chan interface{}, 1)} } // Push adds an item to the Buffer. func (q *Buffer) Push(item interface{}) { q.mu.Lock() q.data = append(q.data, item) q.shift() q.mu.Unlock() } // Force will push the item onto the buffer and blocks until the operation completes. func (q *Buffer) Force(item interface{}) { q.Push(item) } // shift will move the next available entry from our queue into the channel to // be returned to the user. Must be locked by a calling function. func (q *Buffer) shift() { if len(q.data) > 0 { select { case q.nextCh <- q.data[0]: q.data = q.data[1:] default: } } } // Pop returns the next value in the buffer. It returns ok == false if there // is no value in the buffer. func (q *Buffer) Pop() (val interface{}, ok bool) { select { case item := <-q.nextCh: q.mu.Lock() q.shift() q.mu.Unlock() return item, true default: return nil, false } } // Pull pulls an item off the circular buffer. It blocks until a value is found. func (q *Buffer) Pull() interface{} { item := <-q.nextCh q.mu.Lock() q.shift() q.mu.Unlock() return item }

This one worked well enough. But it got me thinking about other ways to do this. I have no idea why. I should have just stopped. I had a good data structure for what I needed. I could optimize later if I needed to. But for some reason, I just decided I wanted to waste a day, very unlike me. Sorry, time to dive first into the rabbit hole.

First, for whatever crazy reason, I dislike using Mutex. In recent times I've begun using atomic.Value and other atomic operations to protect constantly changing values. I've benchmarked these and found this to provide significant increases if you hammer the variable with changes. But that got me thinking about just doing atomic.CompareAndSwap_ for my Mutex needs.

Second, there is a lot of array copies happening in this version. Array copies are very fast1, but I don't like doing quite so many (this irked me, but it has no bearing on the efficiency of the method). Wouldn't some form of circular buffer be faster? Of course, it would need to be a circular buffer that didn't have a set size.

1Note: I saw a talk in the last year from someone showing that array copies were faster than heap allocations and pointer swaps by some crazy multiple. I remember his evidence being fairly good and for this reason, I concentrated on either using channels for the Buffer or slices. However, there are two old sayings from Google that apply here:

In God we trust, everyone else must bring the data

Trust but verify

I'm sure these originate somewhere else for anyone who is really nitpicky. But Google is the first place I heard this said.

Go wants you to be clever, but not too clever

So Go isn't the language for people who want to be clever with optimizations. C is the language for clever optimizations. No runtime to contend with, crazy pointer arithmetic and direct control.

Go really wants you to be clever with your data structures, APIs, etc... Make it easy for your users and maintainers. There is an entire philosophy that bears where after you swim in its waters for a while you do find it refreshing. At first, it just feels cold.

With that said, when you start looking at internals to low-level tasks, you will often find that the standard library has access to code your programs don't. So, for example, if you want to spinlock as well as the Mutex implementation, you can't. Mutex has access to runtime_doSpin(), which your program doesn't.

There are other memory barrier tricks that your code can't take advantage of unless you want to use Cgo, which for the purpose of this article is cheating. And I'm sure I can do a hack to gain access to these things, but that's not portable. One of Go's strength is that until you start using unsafe or syscall, your code really is portable (while C just dreams it is actually portable).

It is important to understand this, because if you try to write a Mutex that is more optimized for your use case, you are not going to be as efficient. You can attempt to do some of this with runtime.Gosched(), but you are rolling that rock uphill. Apparently, I like rolling rocks.

Trying to be clever

My first attempt at being clever was to use channels as a circular buffer. The circular buffer would be protected by a simple atomic.CompareAndSwapInt32(). If I couldn't insert a value into the channel, I would simply create a new channel with double the capacity, copy values from the old channel to the new channel, insert the new value, and swap the old channel for the new one. Pretty much what append() does for slices.

My reasoning was:

A lot of work has been done in channels to prevent the use of locks when the channel is not blocked

select has some very nice properties that can be used

atomic.CompareAndSwapInt32 is much faster than sync.Mutex

I knew that there was some risk:

Mutex makes certain guarantees that a caller will not be starved. CompareAndSwapInt32() by itself does not.

Needed to write my own sleeping code that tried to be a spinlock and eventually slept to some maximum value. Otherwise, when idle these locking loops would eat the CPUs alive.

The sleep code took a while to get to a point that seemed to work in a manner that didn't kill performance nor choke my CPU for too long.

package spin import ( "runtime" "time" ) // Sleeper allows sleeping for an increasing time period up to 1 second // after giving up a goroutine 2^16 times. // This is not thread-safe and should be thrown away once the loop the calls it // is able to perform its function. type Sleeper struct { loop uint16 at time.Duration } // Sleep at minimum allows another goroutine to be scheduled and after 2^16 // calls will begin to sleep from 1 nanosecond to 1 second, with each // call raising the sleep time by a multiple of 10. func (s *Sleeper) Sleep() { const maxSleep = 1 * time.Second if s.loop < 65535 { runtime.Gosched() s.loop++ return } if s.at == 0 { s.at = 1 * time.Nanosecond } time.Sleep(s.at) if s.at < maxSleep { s.at = s.at * 10 if s.at > maxSleep { s.at = maxSleep } } }

The first custom version came out like:

import ( "runtime" "sync/atomic" "time" "github.com/johnsiilver/golib/queue/fifo/internal/spin" ) // Unbounded indicates that the Queue should have no memory bounds. const Unbounded = -1 const ( unlocked = int32(0) locked = int32(1) ) // Buffer provides a FIFO circular buffer that can grow and shrink as required. // Buffer must not be copied after creation (which means use a pointer if // passing between functions). type Buffer struct { // Max is the maximum size the Buffer can grow to. Use Unbounded if // you wish to grow the buffer to any size. By default, this will grow to 1k items. Max int lockInt int32 lastShrink time.Time data chan interface{} } // Push pushes an item into the circular buffer. "ok" indicates if this happens. func (c *Buffer) Push(item interface{}) (ok bool) { c.lock() defer c.unlock() select { case c.data <- item: return true default: } // The buffer was too small, grow the buffer and then insert. c.grow() select { case c.data <- item: return true default: return false } } // Force will push the item onto the buffer and blocks until the operation completes. func (c *Buffer) Force(item interface{}) { sleeper := spin.Sleeper{} for { if c.Push(item) { return } sleeper.Sleep() } } // Pop returns the next value off the circular buffer. If the buffer is empty // ok will be false. func (c *Buffer) Pop() (value interface{}, ok bool) { c.lock() defer c.unlock() select { case v := <-c.data: c.shrink() return v, true default: return nil, false } } // Pull pulls an item off the circular buffer. It blocks until a value is found. func (c *Buffer) Pull() interface{} { sleeper := spin.Sleeper{} for { v, ok := c.Pop() if !ok { sleeper.Sleep() continue } return v } } // grow will double the size of the internal buffer until we hit the max size // if the buffer is currently full. // Note: grow must be protected with .lock/.unlock. func (c *Buffer) grow() { if c.Max == 0 { c.Max = 1000 } if cap(c.data) == c.Max { return } if len(c.data) == cap(c.data) { if c.Max == Unbounded { ch := make(chan interface{}, cap(c.data)*2) c.copy(ch, c.data) c.data = ch return } if cap(c.data) < c.Max { size := cap(c.data) * 2 if size == 0 { size = 8 } if size > c.Max { size = c.Max } ch := make(chan interface{}, size) c.copy(ch, c.data) c.data = ch return } } } // shrink shrinks the size of the internal buffer if the length of the buffer // is < 50% of the capacity. The reduction will be by 25% but will produce // a buffer size of no less than 8 slots. // Note: shrink must be protected with .lock/.unlock. func (c *Buffer) shrink() { if cap(c.data) == 8 { return } if time.Now().Sub(c.lastShrink) > 10*time.Minute { return } // If the current unused capacity is > 50% of the buffer, reduce it by 25%. if (cap(c.data) - len(c.data)) > (cap(c.data) / 2) { size := int(float64(cap(c.data)) * .75) if size < 8 { size = 8 } ch := make(chan interface{}, size) c.copy(ch, c.data) c.data = ch c.lastShrink = time.Now() } } func (c *Buffer) copy(dst chan<- interface{}, src chan interface{}) { if (cap(dst) - len(dst)) < (cap(src) - len(src)) { panic("internal error: Buffer.copy() cannot be called when dst is smaller than src") } close(src) for v := range src { dst <- v } } func (c *Buffer) lock() { for { if atomic.CompareAndSwapInt32(&c.lockInt, unlocked, locked) { return } runtime.Gosched() } } func (c *Buffer) unlock() { for { if atomic.CompareAndSwapInt32(&c.lockInt, locked, unlocked) { return } runtime.Gosched() } }

This one had some nice properties. It had a useful zero value, which I tried to use with other implementations instead of a constructor (which was not always possible).

It also only shrunk at intervals, reducing the need for array copies.

The Truth is in the Pudding

Now I have two versions, but how do each stack up against each other? Oh, the fun of benchmarking!

bench.go

package bench import ( "sync" "testing" ) type unbounded interface { Force(item interface{}) Pull() interface{} } func singleRun(bench *testing.B, n func() unbounded, items, senders, receivers int) { for i := 0; i < bench.N; i++ { bench.StopTimer() b := n() sendCh := make(chan int, items) wg := sync.WaitGroup{} wg.Add(items) // Setup senders. for i := 0; i < senders; i++ { go func() { for v := range sendCh { b.Force(v) } }() } // Setup receivers. for i := 0; i < receivers; i++ { go func() { for { b.Pull() wg.Done() } }() } bench.StartTimer() // Send to Buffer (which the receivers will read from) go func() { for i := 0; i < items; i++ { sendCh <- i } close(sendCh) }() wg.Wait() } }

bench_test.go

package bench import ( "fmt" "testing" "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/custom_locks" "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/custom_sleep_atomic" "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/sliding_slice_atomic" "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/sliding_slice_locks" ) func BenchmarkUnboundedQueues(b *testing.B) { runs := []struct{ items, senders, receivers int }{ {100000, 1, 1}, {100000, 10, 1}, {100000, 1, 10}, {100000, 10, 10}, {100000, 100, 10}, {100000, 10, 100}, } benchmarks := []struct { name string newFunc func() unbounded }{ { "sliding_slice_atomic", func() unbounded { return sliding_slice_atomic.New() }, }, { "sliding_slice_locks", func() unbounded { return sliding_slice_locks.New() }, }, { "custom_sleep_atomic", func() unbounded { return &custom_sleep_atomic.Buffer{} }, }, { "custom_locks", func() unbounded { return &custom_locks.Buffer{} }, }, } for _, run := range runs { for _, benchmark := range benchmarks { b.Run( benchmark.name+fmt.Sprintf("-items: %d, senders: %d, receivers: %d", run.items, run.senders, run.receivers), func(b *testing.B) { singleRun(b, benchmark.newFunc, run.items, run.senders, run.receivers) }, ) } } }

This benchmark was setup to test a few scenarios:

Single sender and receiver

10 senders and 1 receiver

1 sender and 10 receivers

10 senders and 10 receivers

100 senders and 10 receivers

10 senders and 100 receivers

This gives a certain static scaling of the number of senders and receivers to see how each scales for different scenarios. I want a good generic solution that handles each of these reasonably.

In this initial test, I also included versions of both that use atomic.CompareAndSwapInt32() and sync.Mutex. Here are the initial results:

Note: Bold is the fastest version.

Version #Items #Senders #Receivers Iterations ns/op sliding_slice_atomic 100000 1 1 30 50688582 sliding_slice_locks 100000 1 1 30 41097381 custom_sleep_atomic 100000 1 1 20 60742172 custom_locks 100000 1 1 30 61267323

Version #Items #Senders #Receivers Iterations ns/op sliding_slice_atomic 100000 10 1 30 46452147 sliding_slice_locks 100000 10 1 50 37041679 custom_sleep_atomic 100000 10 1 10 106059804 custom_locks 100000 10 1 20 116124583

Version #Items #Senders #Receivers Iterations ns/op sliding_slice_atomic 100000 100 1 30 59664733 sliding_slice_locks 100000 100 1 30 39991409 custom_sleep_atomic 100000 100 1 1 1048544140 custom_locks 100000 100 1 2 699103394

Version #Items #Senders #Receivers Iterations ns/op sliding_slice_atomic 100000 10 10 30 50459215 sliding_slice_locks 100000 10 10 30 42415787 custom_sleep_atomic 100000 10 10 20 76158348 custom_locks 100000 10 10 20 74131977

Version #Items #Senders #Receivers Iterations ns/op sliding_slice_atomic 100000 10 100 20 82940141 sliding_slice_locks 100000 10 100 30 45741994 custom_sleep_atomic 100000 10 100 5 221296458 custom_locks 100000 10 100 20 97246758

Well....huh...

First take away:

atomic and custom sleeping is not beating standard Mutex

The array copies are most likely beating the custom channel copies

Using select and channel, while providing some advantages are probably hurting us with a lot of extra locking and other magic

There was some profiling and such, but I decided to go on and test a bunch of other versions.

If At First You Don't Succeed, Fail Some More??

So it would be lying to say I'm covering all the steps I took here. It was a winding path of outright failure and tuning. There were other tests I wrote to prove that the CPU would lower after being stuck either pushing or pulling and other things I'm not going to cover. The rabbit hole seems to keep going.

In the end, I decided to benchmark against a few more methods:

Versions of each that had no sleep. These would eat CPU, but I wanted to see if it made a difference with using atomic for locking

I wanted to run against a reference type that used a channel with a buffer of 100 and was not unbounded. Resizing to support unbounded is expensive, but how expensive?

What if I used a queue that was more CS standard, records with pointers pointing to the next entry in the queue (I called this the heap version). According to the talk I saw, this should be much slower, but how much slower?

Well, let's have a look:

Note: Bold is for the fastest, discounting the reference.

Note: The reference is in blue when it was the fastest.

Version #Items #Senders #Receivers Iterations ns/op reference 100000 1 1 100 20660880 sliding_slice_atomic 100000 1 1 30 49620908 sliding_slice_locks 100000 1 1 30 38928647 custom_sleep_atomic 100000 1 1 30 60764994 custom_nosleep 100000 1 1 20 61262504 custom_locks 100000 1 1 20 63590469 heap_lock 100000 1 1 50 27832756 heap_atomic 100000 1 1 100 25798467

Version #Items #Senders #Receivers Iterations ns/op reference 100000 10 1 50 31719731 sliding_slice_atomic 100000 10 1 20 99060012 sliding_slice_locks 100000 10 1 30 78695899 custom_sleep_atomic 100000 10 1 10 119628479 custom_nosleep 100000 10 1 10 105837735 custom_locks 100000 10 1 10 111282187 heap_lock 100000 10 1 50 28885042 heap_atomic 100000 10 1 50 28278157

Version #Items #Senders #Receivers Iterations ns/op reference 100000 100 1 50 29424432 sliding_slice_atomic 100000 100 1 30 45287610 sliding_slice_locks 100000 100 1 30 49388002 custom_sleep_atomic 100000 100 1 2 1014270608 custom_nosleep 100000 100 1 1 1018922777 custom_locks 100000 100 1 3 478910481 heap_lock 100000 100 1 100 26944540 heap_atomic 100000 100 1 50 28301942

Version #Items #Senders #Receivers Iterations ns/op reference 100000 10 10 50 29539230 sliding_slice_atomic 100000 10 10 30 47400717 sliding_slice_locks 100000 10 10 30 44891524 custom_sleep_atomic 100000 10 10 20 73116554 custom_nosleep 100000 10 10 20 74409189 custom_locks 100000 10 10 20 70609204 heap_lock 100000 10 10 50 23445934 heap_atomic 100000 10 10 50 26386795

Version #Items #Senders #Receivers Iterations ns/op reference 100000 10 100 50 24754023 sliding_slice_atomic 100000 10 100 20 58304944 sliding_slice_locks 100000 10 100 50 46103498 custom_sleep_atomic 100000 10 100 5 203424993 custom_nosleep 100000 10 100 5 211169360 custom_locks 100000 10 100 20 89729445 heap_lock 100000 10 100 50 25450734 heap_atomic 100000 10 100 100 34094147

Well....huh...., again!

It is not surprising the reference is faster in several versions. But what surprised me was the standard pointer based queue. The heap_lock or heap_atomic took this test by storm.

heap_atomic and heap_lock being so close in most benchmarks leads me to believe that the custom.* versions are slower because there is a bunch of lock contention with the combination of channels/locks(or atomic)/select. I could go work on that, but I don't think that is worth my time.

Now, I'm not sure why the speaker's assessment I listened to was wrong. I tend to think that this kind of thing is my fault. It could be:

I misheard what he was saying

I'm optimizing in some way

The compiler has changed

The benchmark is broken

He was just dead wrong

No matter which one it is, my heap_lock optimization is where my effort is going to concentrate on.

But why not heap_atomic? While it did win in some benchmarks, simpler is better. There isn't enough difference to make me want to use atomic over Mutex and Mutex makes guarantees on goroutine starvation that atomic will not.

What If I Made the Heap Version Circular?

The heap version uses a simple queue. It is memory efficient and apparently fast. But I'm still haunted by allocating extra memory when I don't need to.

What if instead of throwing away entries when dequeuing I just move a pointer in a circular buffer? That entry could be reused without the need for allocating a new entry. And if I run out of room I can always insert more entries at the back.

This will add some overhead in checking if things like the front and tail pointers are the same to know if we need to do an allocation.

Here's a first attempt at it with 100k items:

Version #Items #Senders #Receivers Iterations ns/op heap_lock 100000 1 1 50 26110856 heap_circular 100000 1 1 50 26823725 heap_lock 100000 10 1 50 29731315 heap_circular 100000 10 1 30 34592901 heap_lock 100000 100 1 50 30892013 heap_circular 100000 100 1 30 40746096 heap_lock 100000 10 10 50 27873525 heap_circular 100000 10 10 50 35315318 heap_lock 100000 10 100 20 57328038 heap_circular 100000 10 100 30 60764377

Well, it looks like the queue is still winning. Not by a huge amount, but enough.

I wonder if we see better results with 1m items:

Version #Items #Senders #Receivers Iterations ns/op heap_lock 1000000 1 1 5 275110923 heap_circular 1000000 1 1 5 251509839 heap_lock 1000000 10 1 5 316068383 heap_circular 1000000 10 1 3 367007666 heap_lock 1000000 100 1 3 489702046 heap_circular 1000000 100 1 3 397814725 heap_lock 1000000 10 10 3 418825312 heap_circular 1000000 10 10 3 404838549 heap_lock 1000000 10 100 2 579100509 heap_circular 1000000 10 100 2 630555319

How about 10m items:

Version #Items #Senders #Receivers Iterations ns/op heap_lock 10000000 1 1 1 2719649748 heap_circular 10000000 1 1 1 2896301313 heap_lock 10000000 10 1 1 3239424103 heap_circular 10000000 10 1 1 3812422117 heap_lock 10000000 100 1 1 3314603774 heap_circular 10000000 100 1 1 4393173577 heap_lock 10000000 10 10 1 4271595984 heap_circular 10000000 10 10 1 4358516405 heap_lock 10000000 10 100 1 6051491583 heap_circular 10000000 10 100 1 5518405633

Now my growing circular buffer is probably overly complicated, which leads to some time suck. And it is close in time to heap_lock for the common cases (1:1/10:10/10:100 senders to receivers).

But looking at the trace tool, the allocations are just minor time sinks, gc rarely kicks in because of this on the heap_lock. That makes heap_lock the overall winner in my view. The code is easy to follow and fast.

I think heap_circular has merit, but with a slightly different interface and only if the data being stored is concrete not interface{}. If the entries recorded large data, such as []byte{}, we could use it as a combination FIFO queue and free-list. You could have a method to return an empty value in which it would pull a dequeued entry or if full would give a new one. That could then be used to enqueue without a large allocation. But that is a very specific use case, which I am not going to test. And of course, there is the very good argument of just having a free list and this queue separate.

I am absolutely sure that many of these can be made faster with some trivial work (but I've wasted enough time with this thought experiment). And there were all kinds of things I didn't try (using []unsafe.Pointer and CompareAndSwapPointer() or a slice of slices or ...).

But now I am absolutely sure that:

I should have used the one in Attempt #1

I am not a computer scientist

If you read this far, I'm sorry to have dragged you down the rabbit hole with me. Misery occasionally loves company.