Or: How adding goroutines can keep your CPU busy shuffling things around.

Our starting point is a simple serial loop that does nothing but summing up the loop index.

I bet you had this situation (and feeling) a few times already, but did it make your code faster every time? Here is an example of a simple loop that seems easy to turn into concurrent code - but as we will see, the concurrent version will not just be not faster, it actually will take double the time.

Here, a large loop! Let’s chop the input into pieces and the loop into goroutines!

Obviously, this loop will only occupy a single (logical) CPU core. The natural Gopher reflex is thus to break this into goroutines. Goroutines are functions that can run independent from the rest of the code and hence can be distributed among all available CPU cores.

The speed gain is… negative!

So how does the concurrent version compare to the serial one? A quick benchmark should show the gain we get. A *_test.go file holds two simple benchmark functions, one for each of our two loop functions.

package concurrencyslower import "testing" func BenchmarkSerialSum (b * testing.B) { for i := 0 ; i < b.N; i ++ { SerialSum () } } func BenchmarkConcurrentSum (b * testing.B) { for i := 0 ; i < b.N; i ++ { ConcurrentSum () } }

My CPU is a small laptop CPU (two hyperthreaded cores, which the Go runtime sees as four logical cores), but still, the concurrent version should show a noticeable speed gain.

So le’t run the benchmarks:

$ go test -bench . goos: darwin goarch: amd64 pkg: github.com/appliedgo/concurrencyslower BenchmarkSerialSum-4 1 6090666568 ns/op BenchmarkConcurrentSum-4 1 15741988135 ns/op PASS ok github.com/appliedgo/concurrencyslower 21.840s

The prefix -4 confirms that the test uses all four logical cores. But wait, what’s that? The concurrent loop took more than twice the time as its serial counterpart. And that despite it used all four logical cores. What is going on here?

Hardware acceleration fires back

To explain this counter-intuitive result, we have to take a look at what lies beneath all software—the CPU chip.

The problem starts at the point where cache memory helps speeding up each CPU core.

The following is a gross oversimplification for the sake of clarity and conciseness, so dear CPU designers, please be lenient towards me. Every modern CPU has a non-trivial hierarchy of caches that sit somewhere between main memory and the bare CPU cores, but for our purpose we will only look at the caches that belong to individual cores.

The purpose of a CPU cache

A cache is, generally speaking, a very small but super fast memory block. It sits right on the CPU chip, so the CPU does not have to reach out to main RAM every time when reading or writing a value. Instead, the value is stored in the cache, and subsequent reads and writes benefit from faster RAM cells and shorter access routes.

Every core of a CPU has its own local cache that is not shared with any other core. For n CPU cores this means that there can be up to n+1 copies of the same data; one in main memory, and one in every CPU core’s cache.

Now when a CPU core changes a value in its local cache, it has to be synchronized back to main memory at some point. Likewise, if a cached value gets changed in main memory (by another CPU core), the cached value is invalid and needs to get refreshed from main memory.

Please enable JavaScript to view the animation.

The cache line

To synchronize cache and main memory in an efficient way, data is synchronized in blocks of typically 64 bytes. These blocks are called cache lines.

So when a cached value changes, the whole cache line gets synchronized back to main memory. Likewise, the caches of all other CPU cores that contain this cache line must now also sync this cache line to avoid operating on outdated data.

Neighborhood

How does this affect our code? Remember that the concurrent loop uses a global slice to store the intermediate results. The elements of a slice are stored in a contiguous space. With high probability, two adjacent slice elements will share the same cache line.

And now the drama begins.

n CPU cores with n caches repeatedly read from and write to slice elements that are all in the same cache line. So whenever one CPU core updates “its” slice element with a new sum, the cache lines of all other CPU’s get invalidated. The changed cache line must be written back to main memory, and all other caches must update their respective cache line with new data. Even though each core accesses a different part of the slice!

This consumes precious time—more than the time that the serial loop needs for updating its single sum variable.

This is why our concurrent loop needs more time than the serial loop. All the concurrent updates to the slice cause a hectic cache line sync dance.

Please enable JavaScript to view the animation.

Spread the word data!

Now that we know the reason for the surprising slowdown, the cure is obvious. We have to turn the slice into n individual variables that hopefully will be stored far enough apart from each other so that they do not share the same cache line.

So let’s change our concurrent loop so that each goroutine stores its intermediate sum in a goroutine-local variable. In order to pass the result back to the main goroutine, we also have to add a channel. This in turn allows us to remove the wait group, because channels are not only a means for communication but also an elegant synchronization mechanism.

Concurrent loop with local variables