How Rust optimizes async/await I Aug 18, 2019

The issue to stabilize an initial version of async/await in Rust has left final comment period. The feature looks slated to stabilize in an upcoming release, most likely 1.39.

This represents the culmination of an enormous amount of work by people all over the Rust community. But it’s also only the beginning of async/await support in Rust. The feature set being stabilized is a “minimum viable product” for shipping async/await, and we plan to continue to expand the feature set after initial stabilization.

I think this is a great success for the release train model of Rust. Users will gain access to stable features sooner, without being stuck with any “rushed” decisions made to get the features out the door.

One of the blockers mentioned in the RFC is the size of the state machines emitted by async fn . I’ve spent the last few months tackling this problem, and wanted to give people a window into the process of writing these optimizations, with all the intricacies involved.

Background: Nesting Futures

If you’ve been following the development of async/await in Rust, you know that an async fn in Rust returns a type that implements the Future trait. A future represents a value that isn’t ready yet: it could be a database request that’s pending a response, or an asynchronous read from the filesystem, for example.

My introduction to futures in Rust came by way of Aaron Turon’s blog post, Zero-cost futures in Rust. Reading it was one of those defining moments that made me realize just how important Rust was going to be. Here’s a snippet:

let future = id_rpc( & my_server).and_then( | id | { get_row(id) }).map( | row | { json::encode(row) }).and_then( | encoded | { write_string(my_socket, encoded) }); This is non-blocking code that moves through several states: first we do an RPC call to acquire an ID; then we look up the corresponding row; then we encode it to json; then we write it to a socket. Under the hood, this code will compile down to an actual state machine which progresses via callbacks (with no overhead), but we get to write it in a style that’s not far from simple blocking code.

The entire state of a request is called a task. Aaron continues in a later post:

Tasks do not require their own stack. In fact, all of the data needed by a task is contained within its future. That means we can neatly sidestep problems of dynamic stack growth and stack swapping, giving us truly lightweight tasks without any runtime system implications.

Seeing code written like this that compiled down to one state machine, with full code and data inlining, and no extra allocations, was captivating. You may as well have dropped out of the sky on a flying motorcycle and told me that magic exists, and I was a wizard.

What do I mean by “data inlining”? The above example can be approximated as a type that resembles the following:

enum RequestFuture { Initialized, FetchingId(IdRpcFuture), FetchingRow(GetRowFuture), WritingString(WriteStringFuture), Complete }

Note that we nest other futures within our future type. And like any other enum, the compiler only needs to reserve space to hold one variant at any given time: it knows that when our future is in the FetchingRow state, it doesn’t need to keep around space for the old IdRpcFuture . We can simply reuse those bytes in memory for the GetRowFuture . The result of all this is that the size of RequestFuture in memory grows as the maximum of the sub-futures it contains.

Those futures, in turn, may contain other nested futures. The compiler lays those out the same way. We’ve created a tree of state machines. Even though the implementation details of these nested futures are opaque to me as a programmer, they are known to the compiler, which lays them out optimally inside a single type.

Enter async/await

Fast forward a year or two, and more and more people are realizing that while combinator-based Futures are extremely powerful, they aren’t always that fun to write. They often result in deeply-nested callbacks, which isn’t the ordinary way people write code in Rust. It also isn’t possible to hold a reference across callbacks, which leads to very awkward workarounds.

async/await, which had been considered from the beginning, was now a high priority.

Our example from earlier could be written like this using async/await:

let id = id_rpc( & my_server).await; let row = get_row(id).await; let encoded = json::encode(row); write_string(my_socket, encoded).await;

Even in this simple example, the code is easier to read! Many people who don’t mind using unstable features on the nightly compiler decided to start using async/await directly, because of all the benefits that came with it.

Zero-cost, eventually

One problem with this: Up until a few weeks ago, the size of the Future emitted by an async fn could easily be huge. In particular, we didn’t do any of the “size inlining” tricks I mentioned above. In short, this meant that the size of an async fn -generated future grew exponentially with each new level of future it awaited. Instead of overlapping parts of our state machine at each level of the tree, we were sprawling the whole thing out into memory.

This sometimes led to stack overflows in code that had many levels of futures being awaited. Even when these futures didn’t overflow the stack, they could still grow quite large. In Fuchsia, some tests had async fn s which returned a single state machine over 400 kB in size!

This is not a design flaw of async/await; it was an intentional gap left in the early implementation. It was also quite fixable, but required substantial work writing new optimizations in the compiler, something that was completely new to me when I started working on this problem.

Throughout the rest of this series, I’ll cover the process of implementing these optimizations. I hope you’ll come away with a deeper understanding of what you’re getting when you use async/await in Rust, and more insight into what’s going on inside the compiler. We’ll also take a detour through another exciting, but unstable, feature in Rust today: generators.

Every async fn is a generator

Inside the compiler, async fn in Rust is implemented using generators. You may have seen generators in Python, Ruby, or C#.

Generators, also known as coroutines, are a way of doing “lazy evaluation” in an imperative programming language. Unlike async/await, they are not slated for stabilization anytime soon. That said, they allow you to write code like the following in Rust:

let mut gen = || { let xs = vec ! [ 1 , 2 , 3 ]; let mut sum = 0 ; for x in xs { sum += x; yield sum; } };

Here, gen is a generator. As you can see, it’s declared using the normal syntax for closures. However, generators are not called like normal closures. They can be resumed multiple times, via a method called resume() .

Each time resume() is called, the code inside the closure will run until it hits a yield statement. The value being yielded (in this case, the current value of sum ) will be returned by resume() . In this example, the first time we resume we’d get 1, then 3, then 6, before the generator returns.

resume() really returns an enum called GeneratorState . Here’s the definition:

enum GeneratorState < Y, R > { Yielded(Y), Complete(R), }

Our generator above has a yield type of i32 and doesn’t return anything, so its return type is () . The first three calls to resume() would return GeneratorState::Yielded(x) with some value x , followed by GeneratorState::Complete(()) .

So, what does all this have to do with async fn ? The key feature of an async fn is that it allows you to await another future, which suspends execution of your function until that future is complete. That last part — suspending execution of a function — is exactly what yield does!

And indeed, .await can be, and is, implemented in terms of yield . So when we optimize generators, we’re also optimizing .await .

Generators as data structures

Let’s look at a slightly more complicated example:

let xs = vec ! [ 1 , 2 , 3 ]; let mut gen = || { let mut sum = 0 ; for x in xs.iter() { // iter0 sum += x; yield sum; // Suspend0 } for x in xs.iter().rev() { // iter1 sum -= x; yield sum; // Suspend1 } };

This example yields the sequence 1, 3, 6, 3, 1, 0, before returning.

Generators work by saving the internal state of a function inside a state machine object. In our example, xs and sum would be saved inside our state machine, along with the iterators tracking the state of each for loop. We call these iter0 and iter1 , respectively. Finally, we store an enum corresponding to the next place the generator should resume from: either the beginning, or a yield point.

If you were to take this generator and “compile it by hand,” you might write the type definitions out something like this:

enum SumGeneratorState { Unresumed, Suspend0, Suspend1, Finished } struct SumGenerator { resume_from: SumGeneratorState , xs: Option < Vec < i32 >> , iter0: Option < Iter < 'self , i32 >> , iter1: Option < Iter < 'self , i32 >> , sum: Option < i32 > , }

Now when our generator is resumed, we can match on the value of resume_from and behave accordingly. Note the use of Option , because local variables go in and out of scope as our generator executes. We can’t have uninitialized values in Rust, so we stick a None in the option when the variable is out of scope.

There’s one fishy thing going on here. Our iterator type references our vec xs , another field in the same type. We use the special lifetime 'self to represent this. Rust doesn’t normally allow writing data structures which reference themselves (at least, not without writing unsafe code. ) But the compiler knows how to do this safely. We’re pretending to be the compiler here, so we can do this.

It’s worth noting here that the ability to write code that keeps references across suspend points — whether for yield or its close cousin await — is a major development. It’s a big improvement over the world of Futures combinators, which didn’t allow anything like this.

Layout variants

Storing all local variables inside a struct is a fine way of doing it. In fact, it’s pretty much exactly what the Rust compiler did a few months ago!

But remember, we want to optimize the size of our state machines. The problem is, we’re allocating space for two iterators instead of one. This may not seem like a big deal, but when code gets more complicated than this (and it does), these wasted bytes can really add up.

What if, instead, we stored the state inside our enum variants?

enum SumGenerator { Unresumed { xs: Vec < i32 > }, Suspend0 { xs: Vec < i32 > , iter0: Iter < 'self , i32 > , sum: i32 }, Suspend1 { xs: Vec < i32 > , iter1: Iter < 'self , i32 > , sum: i32 }, Returned }

When we switch between variants, we can simply move xs and sum between them. This is more ideal: Now, the size of our enum is the maximum of the size we need at any given time. As an added benefit, we’ve dropped the Options, since we know which locals are in use at every point in the function.

In reality, the compiler doesn’t move variables between variants at all. Efficiency is one reason for this, but another is the fact that generators can hold references to themselves. What if some variable holds a reference to the variable we’re moving? Then we would be left with a dangling pointer, and our program would misbehave.

Instead, the compiler reserves space for every variable once, and allows the same variable to be included in multiple variants. What we’re left with is a many-to-many relationship, like the following:

Unresumed Suspend0 Suspend1 Finished xs ✓ ✓ ✓ iter0 ✓ iter1 ✓ sum ✓ ✓

Note that we can still reuse the same bytes for iter0 and iter1 here, because they’re never used at the same time. When the generator is in the Suspend0 variant, those bytes will be interpreted as belonging to iter0 . When it’s in the Suspend1 variant, they’ll belong to iter1 . This can happen even for variables of completely different types and sizes.

The astute reader will note that this enum-like approach is exactly what we wanted earlier for our future type. But besides saving memory, there are other reasons we want to do it this way.

By tying the local variables which are in use to the current state of the enum, we’re preserving a lot of really useful info for developer tools. Debuggers can read the debuginfo we emit, and actually understand our generator objects enough to know which local variables are in use. Additionally, it should allow miri to validate the unsafe code inside our generators, checking for uses of unsafe that result in undefined behavior. Neat!

Conclusion

In this post, we learned about generators in Rust and how async/await is implemented using them. We saw that generators are laid out in an enum-like data structure, which has a many-to-many mapping between variables and layout variants. And we saw an example of self-borrowing generators, a large part of what makes async/await so easy to use in Rust.

So far, we’ve left out some important considerations:

What does the implementation of .await actually look like? In the general case, how do compilers decide which variables go in which variants? How do we actually allocate the bytes optimally in memory?

Over the rest of this series, we’re going to answer these questions. As we’ll see, there’s quite a bit more subtlety to them than one might expect.

See comments on r/rust and HN.

Thanks to Josh Dover for reviewing a draft of this post. Thanks to Taylor Cramer, Petr Hosek, and Paul Kirth for reviewing a precursor to this post.

Appendix: Implementation

After I published this post initially, some people requested links to the implementation. If you want to jump into the code, the Rustc Guide is a great resource to understanding many details about the compiler.

The implementation of the topics in this post mainly had to do with refactoring the Rust compiler so that we could express multi-variant layouts for generators.