Glyph Lefkowitz wrote an enlightening post recently in which he expounds in some detail about the challenges of writing highly concurrent software. If you write software and you haven’t read it yet I recommend you do. It’s a very well written piece, full of wisdom that no modern software engineer should be without.

There are many tidbits to extract, but if I may be so bold as to provide a summary of its main point, it would be this: The combination of preemptive multitasking and shared state generally leads to unmanageable complexity, and should be avoided by those who would prefer to maintain some amount of their own sanity. Preemptive scheduling is fine for truly parallel activities, but explicit cooperative multitasking is much preferred when mutable state is being shared across multiple concurrent threads.

With cooperative multitasking, your code will still likely be complex, it just has a chance at staying manageably complex. When control transfer is explicit a reader of the code at least has some visible indication of where things might go off the rails. Without explicit markers every new statement is a potential landmine of “What if this operation isn’t atomic with the last?” The spaces between each command become endless voids of darkness from which frightening Heisenbugs arise.

For the last year-plus while working on Heka (a high performance data, log, and metrics processing engine) I’ve been mostly programming in Go. One of Go’s selling points is that there are some very useful concurrency primitives baked right into the language. But how does Go’s approach to concurrency fare when viewed through the lens of encouraging code that supports local reasoning?

Not very well, I’m afraid. Goroutines all have access to the same shared memory space, state is mutable by default, and Go’s scheduler makes no guarantees about when, exactly, context switching will occur. In a single core setting I’d say Go’s runtime falls into the “implicit coroutines” category, option 4 in Glyph’s list of oft-presented async programming patterns. When goroutines can be run in parallel across multiple cores, all bets are off.

Go may not protect you, but that doesn’t mean you can’t take steps to protect yourself. By using some of the primitives that Go provides it’s possible to write code that minimizes unexpected behavior related to preemptive scheduling. Consider the following Go port of Glyph’s example account transfer code (ignoring that floats aren’t actually a great choice for storing fixed point decimal values):

func Transfer(amount float64, payer, payee *Account, server SomeServerType) error { if payer.Balance() < amount { return errors.New("Insufficient funds") } log.Printf("%s has sufficient funds", payer) payee.Deposit(amount) log.Printf("%s received payment", payee) payer.Withdraw(amount) log.Printf("%s made payment", payer) server.UpdateBalances(payer, payee) // Assume this is magic and always works. return nil }

This is clearly unsafe to be called from multiple goroutines, because they might concurrently get the same result from the Balance calls, and then collectively ask for more than the balance available with the Withdraw calls. It’d be better if we made it so the dangerous part of this code can’t be executed from multiple goroutines. Here’s one way to accomplish that:

type transfer struct { payer *Account payee *Account amount float64 } var xferChan = make(chan *transfer) var errChan = make(chan error) func init() { go transferLoop() } func transferLoop() { for xfer := range xferChan { if xfer.payer.Balance < xfer.amount { errChan <- errors.New("Insufficient funds") continue } log.Printf("%s has sufficient funds", xfer.payer) xfer.payee.Deposit(xfer.amount) log.Printf("%s received payment", xfer.payee) xfer.payer.Withdraw(xfer.amount) log.Printf("%s made payment", xfer.payer) errChan <- nil } } func Transfer(amount float64, payer, payee *Account, server SomeServerType) error { xfer := &transfer{ payer: payer, payee: payee, amount: amount, } xferChan <- xfer err := <-errChan if err == nil { server.UpdateBalances(payer, payee) // Still magic. } return err }

There’s quite a bit more code here, but we’ve eliminated the concurrency problem by implementing a trivial event loop. When the code is first executed, it spins up a goroutine running the loop. Transfer requests are passed into the loop over a channel created for that purpose. Results are returned back to outside of the loop via an error channel. Because the channels are unbuffered, they block, and no matter how many concurrent transfer requests come in via the Transfer function, they will be served serially by the single running event loop.

The code above is a little bit awkward, perhaps. A mutex would be a better choice for such a simple case, but I’m trying to demonstrate the technique of isolating state manipulation to a single goroutine. Even with the awkwardness, it performs more than well enough for most requirements, and it works, even with the simplest of Account struct implementations:

type Account struct { balance float64 } func (a *Account) Balance() float64 { return a.balance } func (a *Account) Deposit(amount float64) { log.Printf("depositing: %f", amount) a.balance += amount } func (a *Account) Withdraw(amount float64) { log.Printf("withdrawing: %f", amount) a.balance -= amount }

It seems silly that the Account implementation would be so naive, however. It might make more sense to have the Account struct itself provide some protection, by not allowing any withdrawals that are greater than the current balance. What if we changed the Withdraw function to the following?:

func (a *Account) Withdraw(amount float64) { if amount > a.balance { log.Println("Insufficient funds") return } log.Printf("withdrawing: %f", amount) a.balance -= amount }

Unfortunately, this code suffers from the same issue as our original Transfer implementation. Parallel execution or unlucky context switching means we might end up with a negative balance. Luckily, the internal event loop idea applies equally well here, perhaps even more neatly because an event loop goroutine can be nicely coupled with each individual Account struct instance. Here’s an example of what that might look like:

type Account struct { balance float64 deltaChan chan float64 balanceChan chan float64 errChan chan error }

func NewAccount(balance float64) (a *Account) { a = &Account{ balance: balance, deltaChan: make(chan float64), balanceChan: make(chan float64), errChan: make(chan error), } go a.run() return } func (a *Account) Balance() float64 { return <-a.balanceChan } func (a *Account) Deposit(amount float64) error { a.deltaChan <- amount return <-a.errChan } func (a *Account) Withdraw(amount float64) error { a.deltaChan <- -amount return <-a.errChan } func (a *Account) applyDelta(amount float64) error { newBalance := a.balance + amount if newBalance < 0 { return errors.New("Insufficient funds") } a.balance = newBalance return nil } func (a *Account) run() { var delta float64 for { select { case delta = <-a.deltaChan: a.errChan <- a.applyDelta(delta) case a.balanceChan <- a.balance: // Do nothing, we've accomplished our goal w/ the channel put. } } }

The API is slightly different, with both the Deposit and Withdraw methods now returning errors. Rather than directly handling their requests, though, they put the account balance adjustment amount onto a deltaChan, which feeds into the event loop running in the run method. Similarly, the Balance method requests data from the event loop by blocking until it receives a value over the balanceChan.

The important point to note in the code above is that all direct access to and mutation of the struct’s internal data values is done *within* code that is triggered by the event loop. If the public API calls play nicely and only use the provided channels to interact with the data, then no matter how many concurrent calls are being made to any of the public methods, we know that only one of them is being handled at any given time. Our event loop code is much easier to reason about.

This pattern is central to Heka’s design. When Heka starts, it reads the configuration file and launches each plugin in its own goroutine. Data is fed into the plugins via channel, as are time ticks, shutdown notices, and other control signals. This encourages plugin authors to implement their functionality with an event loop–type structure like the example above.

Again, Go doesn’t protect you from yourself. It’s entirely possible to write a Heka plugin (or any struct) that is loose with its internal data management and subject to race conditions. But with a bit of care, and liberal application of Go’s race detector, you can write code that behaves predictably even in the face of preemptive scheduling.

Rob Miller