[llvm-dev] RFC: LLVM Coroutine Representation, Round 2

Hi all: Thank you very much for the feedback during the first round! Special thanks to Eli, Sanjoy, Hal and Chandler for their detailed public and private comments. The model is simpler now and the intrinsics have nice symmetry to them: coro.begin + coro.end, coro.alloc + coro.free. Compared to the previous RFC, coro.fork is gone, coro.init and coro.start are collapsed into one coro.begin, coro.suspend is three-way as opposed to two-way. coro.elide and coro.delete renamed coro.alloc and coro.free. All the changes implemented and tested. Full document, (nicely formatted by github is at: https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst Below is a summary of a proposal to add coroutine support to LLVM. The proposal is motivated primarily by the desire to support C++ Coroutines [1], however the llvm representation is language neutral and can be used to support coroutines in other languages as well. Clang + llvm coroutines allows you to take this code: generator<int> range(int from, int to) { for(int i = from; i < to; ++n) co_yield i; } int main() { int sum = 0; for (auto v: range(1,100)) sum += v; return sum; } And translate it down to this: define i32 @main() #5 { entry: ret i32 4950 } You can also use coroutines in plain C, by calling the builtins mapping to the intrinsics described by this proposal, so that your coroutine can look like: #include "Inputs/coro.h" void print(int v); void* f(int n) { CORO_BEGIN(malloc); while (n-- > 0) { print(n+1); CORO_SUSPEND(); } CORO_END(free); } // CHECK-LABEL: @main int main() { void* coro = f(4); CORO_RESUME(coro); CORO_RESUME(coro); CORO_DESTROY(coro); } https://github.com/GorNishanov/clang/blob/coro-rfc/test/Coroutines/coro.c https://github.com/GorNishanov/clang/blob/coro-rfc/test/Coroutines/Inputs/coro.h I prototyped llvm changes to support this proposal and extended clang coroutine implementation [2] to emit proposed intrinsics to smoke test the proposed design. See "More Attention Required" in the doc/Coroutines.rst [4]for details on what additional work is required beyond cleanup and bug fixes. I would like to continue the discussion on the overall design, representation and a roadmap to start upstreaming the changes with the intention to continue the development in tree incrementally. Roadmap: -------- 1) Get agreement on coroutine representation and overall direction (this mail). .. repeat 1) until happy <=== ARE ARE HERE 2) Get agreement on how coroutine transformation passes integrate into the optimizer pipeline. .. repeat 2) until happy 3) update IR/Intrinsics.td + doc/Coroutines.rst + doc/LangRef.rst 4) trickle in coroutine transformation passes + tests in digestible chunks. 5) get clang changes in (sometime after step 3). 6) fix bugs / remove limitations / add functionality to coroutine passes .. repeat 6) until happy Background ========== LLVM coroutines are functions that have one or more `suspend points`. When a suspend point is reached, the execution of a coroutine is suspended and control is returned back to its caller. A suspended coroutine can be resumed to continue execution from the last suspend point or it can be destroyed. In the following example, function `f` (which may or may not be a coroutine itself) returns a handle to a suspended coroutine (**coroutine handle**) that is used by `main` to resume the coroutine twice and then destroy it: .. code-block:: llvm define i32 @main() { entry: %hdl = call i8* @f(i32 4) call void @llvm.coro.resume(i8* %hdl) call void @llvm.coro.resume(i8* %hdl) call void @llvm.coro.destroy(i8* %hdl) ret i32 0 } In addition to the stack frame which exists when a coroutine is executing, there is an additional region of storage that contains values that keep the coroutine state when a coroutine is suspended. This region of storage is called **coroutine frame**. It is created when a coroutine is called and destroyed when a coroutine runs to completion or destroyed by a call to the `coro.destroy`_ intrinsic. Unlike stackful coroutines [3] which maintains a stack per coroutine, an llvm coroutine frame only contains the values that need to persist across suspend points (there is a path from a use to the definition that crosses a suspend point). Overall idea is that a coroutine is presented to an llvm as an ordinary function with suspension points marked up with intrinsics. We let the optimizer party on the coroutine for awhile. Shortly before the coroutine is eligible to be inlined into its callers, we outline parts of the coroutine that correspond to the code that needs to get executed after the coroutine is resumed or destroyed. The coroutine resumption intrinsics get replaced with direct calls to those outlined functions where possible. Then inliner can inline much leaner and nicer coroutine or any of the outlined parts as desired. If we discover that the lifetime of a coroutine is fully enclosed in the lifetime of its caller, we remove dynamic allocation for coroutine frame and replace it with a static `alloca` on the caller's frame. Please see doc/Coroutines.rst for more details: https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst Concerns: ========= (This section assumes that you looked at doc/Coroutins.rst) coro.begin: 4 arguments or 5? ----------------------------- Background: For standard allocation functions recognized by LLVM, coroutine frame allocation code looks like: entry: %size = call i32 @llvm.coro.size.i32(i8* null) %alloc = call i8* @malloc(i32 %size) %hdl = call noalias i8* @llvm.coro.begin(i8* %alloc, i32 0, i8* null, i8* null) To enable coroutine heap elision with custom allocation functions, `coro.alloc` intrinsic is used to exclude the allocation code when not needed. entry: %elide = call i8* @llvm.coro.alloc() %0 = icmp ne i8* %elide, null br i1 %0, label %coro.begin, label %coro.alloc coro.alloc: %frame.size = call i32 @llvm.coro.size() %alloc = call i8* @MyAlloc(i32 %frame.size) br label %coro.begin coro.begin: %phi = phi i8* [ %elide, %entry ], [ %alloc, %coro.alloc ] %frame = call i8* @llvm.coro.begin(i8* %phi, i32 0, i8* null, i8* null) When using a static alloca for coroutine frame, coro.begin and coro.alloc intrinsics are replaced with that very alloca (bitcasted to i8* to match the type). To find a coro.alloc matching a particular coro.begin, I hunt through the %phi until I find definition of coro.alloc. Probably more robust implementation need to use alias analysis to check whether the %phi and %elide may alias. A simpler approach would be just to add another parameter to directly point to `coro.alloc`. With this change, coro.begin block above would look like: coro.begin: %phi = phi i8* [ %elide, %entry ], [ %alloc, %coro.alloc ] %frame = call i8* @llvm.coro.begin(i8* %phi, i8 %elide, i32 0, i8* null, ^^^^^^^^^ i8* null) All feedback and comments are greatly appreciated!!! Thank you, Gor References: =========== [1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0057r4.pdf [2] http://llvmweekly.org/issue/95 (Coroutine support added to clang) [3] http://www.boost.org/doc/libs/1_61_0/libs/coroutine/doc/html/index.html [4] https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst