A straightforward translation of a C++ or Java program into Go is unlikely to produce a satisfactory result — Java programs are written in Java, not Go. On the other hand, thinking about the problem from a Go perspective could produce a successful but quite different program. — Effective Go

One of the central goals of golang’s design was the development of a language that would easily leverage multi-threaded processes without recompilation and platform customization. The goroutine is not a thread, but when threads are available the scheduler will attempt to use threads to parallelize your workload as much as possible. In order for this to function at all it is up to the developer to use thread safe data structures. In order for parallelization to actually increase performance those data structures must be thread safe in the most efficient way possible.

The Mutex Illusion

In traditional multithreaded design data structures are protected using locking structures such as mutexes and semaphores. Critical sections are protected with these “locks” and resources are managed by preventing one thread from accessing data that is being written to by another until the transaction is completed. Because even simple lines of code are broken down to multiple instructions these locks are crucial to maintain data integrity. When used correctly mutexes can avoid race conditions and invalid data but if done aggressively they can also cripple linear thread scaling. They give the “illusion” of multithreading because the code appears to be running multiple threads but in reality they are just executing multiple threads in sequence.

Consider the following code:

The sleep simulates some “work” and we read and write strings. The following will benchmark it:

As the number of threads scale the naive assumption would be that performance should go up. Unfortunately that is not the case:

Even though the number of threads is scaling the change in processing time is within a reasonable margin of error. If it isn’t already obvious, this is a consequence of the way the code has been designed. Although we have many threads they are all all waiting in line around the mutex. Without the mutex this fictitious workload wouldn’t be safe but all we have done is take the overhead of extra goroutines, some synchronization objects and done no better than if the work was serial. Clearly this is not what we were aiming for.

Using a mutex around a critical section that is more than a single data operation will not increase performance linearly. This example may not have come up in your computer science textbook because mutexes are typically put around data elements in examples. The unfortunate reality is that often times, critical sections surround complex and slow external systems either through API’s, sockets or other slow operations such as encryption or authentication. For this reason the simple model of using a tight mutex around memory addresses won’t work.

The Channel Based Thread Safe Object

What would be nice to have is an object with the same interface as the mutex example but somehow prevent the resource blocking that the Mutex created. While there are several means of doing this — even some with mutexes — using channels can simplify the code and also increase the possible parallelism of the system. The first thing we need to do is to augment the interface a bit. We can retain “Read” and “Write” public methods but we will need to add a “Start” and “Close” method as well in order that we can be sure workloads are fully processed:

Like the benchmark code we use a sync.WaitGroup to make sure that the background thread is finished before shutdown is complete. The background thread is a for/select loop that handles a handful of channels:

This will force writes to happen synchronously as expected and read operations can be asynchronous giving the data at a given snapshot in time:

Here is where the “work” is done that simulates the slow resource. All that is left is to wire the Write and Read into the background thread:

And build a constructor:

The principles are explained in the comments but the basic idea to keep in mind is that memory is moved between goroutines through channels. I will get into how this is done in a bit, but suffice to say for now it is very efficient on modern processors. Once data is handed to “Write” it no longer belongs to the caller (running your tests with “-race” will protect you from violating this rule) and data that is returned from “Read” is only a copy of what is being stored within the struct (either stored in memory or in an abstract sense stored through an API).

With this framework setup here are the benchmark results:

Even with one read and write consumer the performance is better than the mutex design and as more workers are added it quickly approaches the theoretical max (because we designed writes to be synchronous the max would be on the order of 10000000000 ns/op). If we instead had a system where “Write” could be processed out of order the scaling is ridiculous:

With the last two speeds needing to be calculated by (iterations * speed) at 120000000 and 20000000 ns/op.

No Free Lunch

Like everything in software engineering the increased speed of this sort of design comes at a cost. If reads and writes must be processed in the exact order received or if the work being done is on a single resource (such as if you only had a single database connection) this sort of design is clearly overkill. If your system cannot be modeled with parallel workloads then threading is never going to give you better performance. If however you can tolerate asynchronous processing, “stale” but internally consistent reads or if your workloads are dominated by overhead (such as encryption) arranging your threading in this sort of system will give you the scaling that is possible. This becomes even more powerful as you start to piece the data pipleine together further but that is a topic for another day.

Why Channels?

You may be asking yourself: why should I use channels when I can imagine building these sorts of workload distribution systems the way I always have? The secret is in how channels work. Channels use two critical techniques to make these memory operations as fast as possible. The first is the concept of a write barrier, the second is the use of instructions the AMD64’s CMPXCHG which is a Compare and Swap or “CAS” operation.

Write Barriers

When a channel sends data to a destination it must perform a very fast memory move. In golang this is one of the few portions of the code that is written directly in assembler. In order to guarantee that this low level code will always be correct and not interrupted by garbage collection, a write barrier is created. This creates a bitmap to force writes between the source and destination to happen in sequence and once setup the highly optimized assembler for memory move can be safely executed without any of the memory “under” the barrier changed by the garbage collection.

CAS Operations

A CAS operation does the following:

If (accumulator == destination) then (destination <- source)

else (accumulator <- destination)

This single opcode is essentially a processor level solution to the “lazy barber” race condition. If the accumulator is used to store the value of a “lock” variable found at the destination and the source is the “lock state” you wish to set (a non-zero value for instance) you can use a CAS operation to verify that you don’t set a lock state on top of another process that set the lock state while you weren’t looking. Using this operation drastically decreases the cost of the write barriers in the channel’s memmove operation and the low level atomic variables in the channel implementation to a level that cannot be achieved simply with mutexes and high level code. A side note: this is also how the atomic package is implemented in golang so if channels just aren’t appropriate for your synchronization and you need to get down and dirty with your code I encourage you to look at the CompareAndSwap* functions you will find there.

Idiomatic Code

While every language has its purists who strive for idiomatic code out of less than noble reasoning golang’s principle of “Share memory by communicating; don’t communicate by sharing memory.” is more than empty words. Because golang is both managed memory and managed threads (though the scheduler) when you follow idiomatic patterns such as eschewing shared memory your code performance will improve. Organizing your code pipeline processing to pass data through channels not only allows for clearer encapsulation of concerns but also allows the golang system to better handle memory and thread resources. Recent updates in the core language have refined the internal operation of the garbage collector to better associate memory with goroutines but these sorts of optimizations are built around this idiomatic pattern.

Source code for the examples can be found at https://github.com/weberr13/Kata/tree/master/channels and as always, MIT license