In a previous post I investigated how long it takes to start a goroutine to do some work and get a result back from it. To set a baseline I compared this to doing the same thing in C with pthreads. Perhaps unsurprisingly, this proved that goroutines are indeed lightweight as claimed, taking 386ns to start a goroutine and return a result, around 50 times less than the pthread baseline.

Now it’s time to take things further. We’re going to make some thread and goroutine pools. I’m sure it will be wild.

I’m not trying to burst your balloon

A common pattern in the world of threads is to start a pool of workers. These threads start when the application starts. Work is sent to these threads via a queue, so the application can make use of additional CPU cores without the overhead of starting a thread each time.

So, I’ll build some new tests that use this pattern in Go, and in C with pthreads. We’ll launch a pool of 8 threads or 8 goroutines and send 1000 requests to do work via a channel (or home-made channel-like object in the C case), with a buffer of 1000. For each work request the thread adds 1000 times, which I’ve established takes ~2µs in C and ~300ns in Go (in the C case I’ve turned optimisations off else the compiler spots the work is trivial, elides it and then it takes no time. The Go compiler isn’t quite that clever, but does manage some optimisation).

Here’s the C benchmark. The faux-channel implementation is home grown and available here.

Quite pleased with myself for this as I’ve not written C for quite some time

"work_on_1_thread_pool_1000", 100 iterations in 656737600ns, 6.567 ms per iteration "work_on_8_thread_pool_1000", 100 iterations in 1126973740ns, 11.270 ms per iteration

Our 1000 work items take 6ms using 1 thread, or 11ms using 8 threads. We’d expect the 1000 items of work to take ~2*1000µs =~ 2ms in the single threaded case, so there’s 6–2 = 4ms of overhead in the single thread case, or 4µs of overhead per scheduling attempt.

The overhead is worse in the 8 thread case, so much so that this case is actually slower overall. If we split the 2ms of work evenly between the 4 cores (My CPU has 4 cores with HT) we’d expect it to take 500µs, so the overhead of scheduling the work like this is pretty much the entire 11ms, or 11µs per scheduling attempt.

(I should point out that if I increase the work done per schedule, eventually the 8 thread case does get faster than the 1 thread case as you would expect when the scheduling overhead becomes less significant proportion of the workload)

Here’s the Go benchmark.

BenchmarkGoroutineChannel/goroutines=1,buf=1000-8 3000 548624 ns/op BenchmarkGoroutineChannel/goroutines=8,buf=1000-8 5000 272580 ns/op

1000 work items take 270µs using 8 goroutines (and GONUMPROCS=8), and 550µs with 1 goroutine. We’d expect the 1000 items of work to take ~300*1000ns =~ 300µs in the single goroutine case, so there’s about 550–300 = 250µs of scheduling overhead in the single-goroutine case, or about 250ns per schedule.

In the 8 goroutine case, if the work is split evenly across 4 cores, we’d expect it to take 300/4 = 75µs, so the overhead is 270–75 = 195µs, or 195ns per schedule.

So it looks like the goroutine scheduler is quite a bit faster at scheduling work across CPU cores than my attempt at using pthreads in C. I suspect this is likely due to Go’s work-stealing scheduler aggressively avoiding context-switches. This means you can efficiently schedule much smaller work items in Go using the base language primitives than is reasonable with a home-grown thread pool using pthreads in C. Well done Go’s work-stealing scheduler!

[I feel I should point out that it is of course possible to find or implement a Go-like scheduler and coroutines in C, C++ or other languages, and surely possible to meet or exceed Go’s performance given enough cleverness & typing. The point here is not to bash C or threads, but to establish the performance characteristics of goroutines, and to put them in context for people familiar with threads]