I used to think the performance overhead of goroutines and channels is basically negligible — especially when compared to things like IO — but a recent experience gave me a reality check.

I’ve been working on a toy relational database as the project for a course I’m taking. I started out by loading tables from CSV files, and eventually I needed to add a binary table format. Unfortunately, my initial attempt performed much worse than CSV:

$ ./csv_scan_benchmark -path table.csv

Done scanning 20000263 records after 15.69807191s $ ./binary_scan_benchmark -path table.bt

Done scanning 20000263 records after 27.01220384s

Scanning the binary table format was almost twice as slow! I found this extremely counterintuitive, since the binary format was more compact and didn’t require any string conversion. Fortunately, I was able to investigate the issue using go’s excellent tools for performance profiling.

The CPU profile for scanning a CSV table looked reasonable; most of the time was spent on IO related system calls:

CPU profile for scanning a CSV table

But the CPU profile for scanning a binary table looked extremely *un*reasonable; only a tiny fraction of the time was spent on IO!

CPU profile for scanning a binary table (initial attempt)

It turns out that this ridiculous CPU profile was caused by the way I used go’s concurrency primitives. At the time, I thought decoupling production from consumption using goroutines and channels would be a clean way to structure the scanner code.

Creating a scanner launches a producer goroutine, which performs IO / decoding and writes to a result channel:

func NewBinaryScan(path string) (*binaryScan, error) {

...

go s.producer()

return s, nil

} func (s *binaryScan) producer() {

for {

record, err = s.readAndDecodeRecord(s.bufferedReader)

s.resultChan <- &result{record, err}

if err != nil {

return

}

}

}

The consumer goroutine, which retrieves results by repeatedly calling NextRecord , only has to read from the result channel:

func (s *binaryScan) NextRecord() (Record, error) {

result := <-s.resultChan:

return result.record, result.err

}

However, the CPU profile showed that with this approach, goroutines spent a lot of time blocking on channel operations, and the go runtime wasted a lot of resources on scheduling / concurrency primitives.

Rewriting the binary table scanner to do everything from the consumer goroutine was very straightforward:

func NewBinaryScan(path string) (*binaryScan, error) {

...

// No more producer.

return s, nil

} func (s *binaryScan) NextRecord() (Record, error) {

return s.readAndDecodeRecord(s.bufferedReader)

}

However, this small change had a dramatic effect on performance. Here’s the updated CPU profile, which looks much more sensible than before:

CPU profile for scanning a binary table (optimized)

And here’s the updated benchmark result:

$ ./binary_scan_benchmark -path ratings.bt Done scanning 20000263 records after 8.160765247s

Almost 2x faster than CSV, and over 3x faster than the initial attempt. Now that’s more like it!

I still think proper use of goroutines and channels can result in clean and well-structured code, and that in most cases, they won’t turn out to be the root cause of performance issues. But this experience was a good reminder that even go’s awesome concurrency primitives aren’t free.

[EDIT]

Thanks to Raghav for suggesting more cases to benchmark!

Adding a select with a context.Context and a short timeout to the producer: 59.4s

with a and a short timeout to the producer: Adding a select with a context.Context and a short timeout to the consumer: 58.0s

with a and a short timeout to the consumer: Using a buffered channel with size 1: 23.5s

Using a buffered channel with size 1000: 17.4s

Removing the select on a done channel (the done channel wasn’t included in the code snippets above): 19.9s

on a channel (the channel wasn’t included in the code snippets above): Removing the select on a done channel AND using a buffered channel with size 1000: 14.3s

It looks like there’s another lesson to be learned here: channel buffer size and complexity of select statements can both have a big effect on performance.

[EDIT 2]

Thanks Stuart Carnie for the suggestion to send batches of records over the channel instead of a single record at a time! Here are the benchmark results I got for various batch sizes:

1: 28.83s

10: 12.36s

100: 8.92s

1,000: 8.68s

10,000: 8.74s

100,000: 9.32s

It turns out that “reducing the number of channel write operations by 3 orders of magnitude”, as Stuart pointed out, also has a significant effect here.