This is the first in a series of posts sharing lessons learned implementing a performance-oriented cooperative multitasking library with the low level Kotlin coroutines API.

The Kotlin standard library provides an excellent high level coroutines API. It also exposes the raw building blocks for those looking for more control or performance. Coroutines can not only simplify otherwise complicated asynchronous software, but also provide extreme performance gains if used properly.

For this first post, I'll discuss the general structure of a cooperative multitasking system and why you might want to structure an application in this manner.

What is Cooperative Multitasking?

A cooperative multitasking system is one in which tasks voluntarily yield in order to allow other tasks to run. This is in opposition to a preemptive multitasking system in which control can be involuntarily taken from a task at any time.

Patterns for Multithreaded Software

Before getting into what a cooperative multitasking system might look like in the context of the JVM, let's first consider today's standard.

Executors

The most common pattern to utilize modern multi-core processors on the JVM today is to use thread pools, typically via some manner of Executor which takes Runnable tasks and abstracts away how and where they're run. For most use cases this is all well and good, and the Kotlin standard library's own high level coroutine API plays nicely with this paradigm.

The documentation for Executor states that the very purpose of this system is "decoupling task submission from the mechanics of how each task will be run, including details of thread use, scheduling, etc." This all seems quite simple and easy, but let's consider an alternative.

Taking Control

In fact, let's do the exact opposite. We'll have our application take direct responsibility for the scheduling of tasks and where they run. For those coming from the world of Executors this might seem like madness, but it's not as complicated as it might sound and can bring with it great benefits. However it will require rethinking how we structure our application.

Since we want as much control as possible without the operating system making scheduling decisions on our behalf, we'll use just one thread per logical processor. Astute observers might notice that a consequence of having a single thread per logical processor is that if we block one of these threads we now have one of our cores sitting idle. Indeed this is correct. This is where the cooperative part of cooperative multitasking comes into play. This system only works if our tasks graciously yield the floor and no one filibusters during their turn.

Tasks that our scheduler gives control to are responsible for yielding back to the scheduler either when out of work or some reasonable time threshold has been exceeded.

With Executors we're accustomed to spawning new tasks on the slightest whim because the Executor will handle execution. However, since we're now responsible for scheduling these tasks it might not be a good idea to go creating them willy nilly.

Vertical vs Horizontal Processing

Most systems, including many coroutine based systems, approach processing in a vertical manner. To explain what I mean by vertical, consider a simple HTTP server. Processing a request might involve the following steps:

Accept Connection Read from Socket Parse Request Handle Request Write to Socket

Vertical Processing

In a vertical processing system each connection is considered individually. Each stage schedules the next in the Executor, and these individual tasks are executed in the thread pool.



The big advantage of this system is that keeping many threads busy is both easy and automatic. Much of what this diagram portrays is transparent to the user.

However, there are some downsides. First, since we don't have any control over when and where a task is run, any shared state that it accesses must be carefully protected to ensure thread safety. Second, as illustrated at the bottom of the diagram a single thread can bounce randomly between processing different tasks and objects. This makes it unlikely that data and code will be in cache, which is particularly important for getting the most out of modern hardware.

Horizontal Processing

In a horizontal processing system we break the application down into a number of services responsible for handling a single stage of processing.



Rather than bring the task to our executor, we bring data to the tasks. Having broken our application down into just five services the problem of where to run them from our schedulers is a very tractable problem. It's a huge advantage to be able to choose exactly how and where each service will run. For instance, if our NIC only has two RX queues we might choose to have two instances of our Read from Socket service. The work of the Handle Request service might be more involved so we might want one of those services running on each core to help distribute the load. Or we may want just one Connection Accept service if we only have one listening socket so we don't have to worry about protecting a shared resource. We might also want to ensure that the same scheduler is responsible for both Read from Socket and Write to Socket sides of any single connection so we don't have to worry about thread safety of the underlying socket.

With this type of control over our services we get some huge benefits. First, since we're controlling execution ourselves we can avoid a lot of the complexity of dealing with thread safety issues. Second, with a service working on the same type of data over and over we get natural batching effects and have a much higher likelihood of having data and code already in cache. Third, as can be seen in the RX queue example above, we can tailor how we run our application to the hardware it's running on without changing anything besides how we lay out our services.

Coroutines Enable Horizontal Processing

Now that we've fully embraced our never block lifestyle and decided to take control of scheduling we need an actual mechanism to yield execution to our scheduler. Thankfully, this is precisely what we get from coroutines. Coroutines allow us to suspend execution at predefined locations and resume later from that exact point. Not only that, but doing so can be as simple as a function call.

For example, a service whose work comes from a queue that supports this paradigm might simply request the next work item with val work = queue.take() . If the queue is empty, the task will transparently suspend and yield execution back to the scheduler until more work is available. Little is required of the developer beyond ensuring that a service explicitly yield() to the scheduler if so much work is available that it might starve other services.

Conclusions

This style of application isn't for everyone, but for those looking for maximum performance with minimal complexity, a cooperative multitasking system based on coroutines is tough to beat. While the high level Kotlin coroutine library does make mention of and ultimately supports cooperative multitasking, the underlying data structures and algorithms are thread-safe and built for maximum flexibility, hindering performance compared to a more specialized system.

Keys to Performance

It's worth recapping exactly why this type of architecture performs so well over the alternatives.

First, it allows for use of efficient non thread-safe algorithms and data-structures.

A subtle but extremely important benefit of having control over our own scheduling is that we no longer have to worry about our work being preempted in a critical section. If we structure our application to not share mutable state across scheduler boundaries we can completely remove the necessity of synchronization mechanisms and use much more efficient algorithms and data structures. It also has the side benefit of simplifying code that might otherwise require complicated thread safety mechanisms.

Second, it's particular well suited to run on modern hardware.

The natural batching that occurs from processing in a horizontal manner is very efficient on modern hardware. The next work items can be predictably fetched, data structures and code may already be in cache, and bulk operations can be used where appropriate. Traditional Executor based approaches result in much more random access patterns which don't play well with cpu caches.

Third, we can tailor our application to hardware at runtime

The decision of how to distribute our services can be made on application start-up, and taking advantage of the particulars of the hardware we're deploying to can yield significant advantages. This can be as simple as choosing the number of services based on core count, or we go a step farther and bind schedulers to specific cores and place our services even more strategically. Perhaps placing services which might share some data on the same physical socket, or placing services accessing network resources on the same cpu that is handling interrupts for the device.

This first post in this series hopefully covers why you might care about this architecture. There's a lot that goes into a well functioning scheduler and services, and you can expect later posts in this series to dig much deeper into the technical details of those concepts.

Header photo credit US Fish and Wildlife Service, source, formatting alterations made