Previously, I wrote about using coroutines in RethinkDB. Coroutines are a nice alternative to callbacks because they are easier to program in, and they are a nice alternative to threads because of their greater performance. But how fast are they?

Just using an off-the-shelf library like libcoroutine isn’t as fast as you might think. The graph below shows the huge performance degradation of a naive implementation of coroutines (the short red bar) compared to the old callback- based code (the blue bar on the left). But with the right set of optimizations, we can recover a level of performance about equal to the non- coroutine version. With these optimizations, throughput is recorded as the pink bar on the right, which is within the margin of error of the original version.

The improvement comes from two optimizations: reuse of stacks and a lightweight swapcontext implementation. The two optimizations are completely internal to the coroutines library and require no changes to code which uses them. As you can see from the graph, both optimizations are essential for acceptable performance.

Reusing coroutines

Spawning a new coroutine with libcoroutine on Unix issues the following actions:

A stack is allocated

A ucontext object is allocated and initialized using makecontext and getcontext

object is allocated and initialized using and A libcoroutine Coro object is created around the stack and context

object is created around the stack and context The context is swapped to the new coroutine, jumping to a trampoline, and from there, executing user code with supplied arguments.

When the user code terminates, all of these structures have to be deallocated. Unless our allocator is perfect (which it isn’t) and the ucontext routines are really fast (which they aren’t), this constitutes a lot of waste since each select request issues multiple coroutines. Why not reuse the same coroutine object for multiple actions?

The logic is simple. Each OS thread has a list of free coroutines. To do something in a new coroutine, the user pops an item off the free list and sends it a message containing the action. Each coroutine has a ‘run loop’ which does the following:

Wait for a message containing an action

Execute that action

Push self back on to the free list

If a coroutine is requested and the free list is empty, a new coroutine executing that run loop is spawned and pushed onto the list.

A lightweight swapcontext implementation

Once old coroutines are reused, the only frequent call into libcoroutine that gets made is switching contexts. Ultimately, this calls the ucontext routine swapcontext . What does that do exactly? Well, here’s the glibc 2.12.2 implementation for x86-64:

swapcontext: /* Save the preserved registers, the registers used for passing args, and the return address. */ movq % rbx , oRBX ( % rdi ) movq % rbp , oRBP ( % rdi ) movq % r12 , oR12 ( % rdi ) movq % r13 , oR13 ( % rdi ) movq % r14 , oR14 ( % rdi ) movq % r15 , oR15 ( % rdi ) movq % rdi , oRDI ( % rdi ) movq % rsi , oRSI ( % rdi ) movq % rdx , oRDX ( % rdi ) movq % rcx , oRCX ( % rdi ) movq % r8 , oR8 ( % rdi ) movq % r9 , oR9 ( % rdi ) movq ( % rsp ), % rcx movq % rcx , oRIP ( % rdi ) leaq 8 ( % rsp ), % rcx /* Exclude the return address. */ movq % rcx , oRSP ( % rdi ) /* We have separate floating-point register content memory on the stack. We use the __fpregs_mem block in the context. Set the links up correctly. */ leaq oFPREGSMEM ( % rdi ), % rcx movq % rcx , oFPREGS ( % rdi ) /* Save the floating-point environment. */ fnstenv ( % rcx ) stmxcsr oMXCSR ( % rdi ) /* The syscall destroys some registers, save them. */ movq % rsi , % r12 /* Save the current signal mask and install the new one with rt_sigprocmask (SIG_BLOCK, newset, oldset,_NSIG/8). */ leaq oSIGMASK ( % rdi ), % rdx leaq oSIGMASK ( % rsi ), % rsi movl $ SIG_SETMASK , % edi movl $ _NSIG8 , % r10d movl $ __NR_rt_sigprocmask , % eax syscall cmpq $ - 4095 , % rax /* Check %rax for error. */ jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */ /* Restore destroyed registers. */ movq % r12 , % rsi /* Restore the floating-point context. Not the registers, only the rest. */ movq oFPREGS ( % rsi ), % rcx fldenv ( % rcx ) ldmxcsr oMXCSR ( % rsi ) /* Load the new stack pointer and the preserved registers. */ movq oRSP ( % rsi ), % rsp movq oRBX ( % rsi ), % rbx movq oRBP ( % rsi ), % rbp movq oR12 ( % rsi ), % r12 movq oR13 ( % rsi ), % r13 movq oR14 ( % rsi ), % r14 movq oR15 ( % rsi ), % r15 /* The following ret should return to the address set with getcontext. Therefore push the address on the stack. */ movq oRIP ( % rsi ), % rcx pushq % rcx /* Setup registers used for passing args. */ movq oRDI ( % rsi ), % rdi movq oRDX ( % rsi ), % rdx movq oRCX ( % rsi ), % rcx movq oR8 ( % rsi ), % r8 movq oR9 ( % rsi ), % r9 /* Setup finally %rsi. */ movq oRSI ( % rsi ), % rsi /* Clear rax to indicate success. */ xorl % eax , % eax ret

(The o* symbols are #define ‘d in another file to be offsets into the ucontext_t struct.)

You might spot a few unnecessary things here:

Argument registers (RDI, RDX, RCX, R8, R9 and RSI) are saved and restored even though the calling convention doesn’t guantee that they’re saved and restored

x87 and SSE state is saved and restored, even though we’re not doing anything with floats or SIMD where register contents need to be preserved across function calls

The signal mask is saved and restored with the system call sigprocmask , even though the signal handlers in RethinkDB are the same globally

The final sin–making an unnecessary syscall–is the greatest, and eliminating this provides a big speed boost. On top of this, it helps to eliminate the memory traffic from saving and restoring the unnecessary registers.

Things that weren’t important

I made one minor optimization across all of these: stacks were malloc ‘ed rather than calloc ‘ed, since there’s no reason that they have to be zeroed beforehand. It turns out that this isn’t crucial to performance once pooling is applied; it just reduces the warmup time. Reducing the size of the callstacks is similar–with pooling, it just reduces warmup time. Without pooling, both of these optimizations yield significant improvements in throughput. It still might be a good idea to have small callstacks so that they don’t take up too much memory, but this is not a problem right now. In practice, warmup time will be dominated by disk I/O in filling up the in- memory cache, and the database will be restarted only rarely.

Conclusion

Coroutines can be made to work fast, though it’s not as immediate as one might hope. The POSIX primitives aren’t optimal for all applications. I’d like to thank Andrew Hunter for giving me advice on how to proceed in this optimization process.