[rust-dev] The future of M:N threading

Before getting right into the gritty details about why I think we should think about a path away from M:N scheduling, I'll go over the details of the concurrency model we currently use. Rust uses a user-mode scheduler to cooperatively schedule many tasks onto OS threads. Due to the lack of preemption, tasks need to manually yield control back to the scheduler. Performing I/O with the standard library will block the *task*, but yield control back to the scheduler until the I/O is completed. The scheduler manages a thread pool where the unit of work is a task rather than a queue of closures to be executed or data to be pass to a function. A task consists of a stack, register context and task-local storage much like an OS thread. In the world of high-performance computing, this is a proven model for maximizing throughput for CPU-bound tasks. By abandoning preemption, there's zero overhead from context switches. For socket servers with only negligible server-side computations the avoidance of context switching is a boon for scalability and predictable performance. # Lightweight? Rust's tasks are often called *lightweight* but at least on Linux the only optimization is the lack of preemption. Since segmented stacks have been dropped, the resident/virtual memory usage will be identical. # Spawning performance An OS thread can actually spawn nearly as fast as a Rust task on a system with one CPU. On a multi-core system, there's a high chance of the new thread being spawned on a different CPU resulting in a performance loss. Sample C program, if you need to see it to believe it: ``` #include <pthread.h> #include <err.h> static const size_t n_thread = 100000; static void *foo(void *arg) { return arg; } int main(void) { for (size_t i = 0; i < n_thread; i++) { pthread_attr_t attr; if (pthread_attr_init(&attr) < 0) { return 1; } if (pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED) < 0) { return 1; } pthread_t thread; if (pthread_create(&thread, &attr, foo, NULL) < 0) { return 1; } } pthread_exit(NULL); } ``` Sample Rust program: ``` fn main() { for _ in range(0, 100000) { do spawn { } } } ``` For both programs, I get around 0.9s consistently when pinned to a core. The Rust version drops to 1.1s when not pinned and the OS thread one to about 2s. It drops further when asked to allocate 8MiB stacks like C is doing, and will drop more when it has to do `mmap` and `mprotect` calls like the pthread API. # Asynchronous I/O Rust's requirements for asynchronous I/O would be filled well by direct usage of IOCP on Windows. However, Linux only has solid support for non-blocking sockets because file operations usually just retrieve a result from cache and do not truly have to block. This results in libuv being significantly slower than blocking I/O for most common cases for the sake of scalable socket servers. On modern systems with flash memory, including mobile, there is a *consistent* and relatively small worst-case latency for accessing data on the disk so blocking is essentially a non-issue. Memory mapped I/O is also an incredibly important feature for I/O performance, and there's almost no reason to use traditional I/O on 64-bit. However, it's a no-go with M:N scheduling because the page faults block the thread. # Overview Advantages: * lack of preemptive/fair scheduling, leading to higher throughput * very fast context switches to other tasks on the same scheduler thread Disadvantages: * lack of preemptive/fair scheduling (lower-level model) * poor profiler/debugger support * async I/O stack is much slower for the common case; for example stat is 35x slower when run in a loop for an mlocate-like utility * true blocking code will still block a scheduler thread * most existing libraries use blocking I/O and OS threads * cannot directly use fast and easy to use linker-supported thread-local data * many existing libraries rely on thread-local storage, so there's a need to be wary of hidden yields in Rust function calls and it's very difficult to expose a safe interface to these libraries * every level of a CPU architecture adding registers needs explicit support from Rust, and it must be selected at runtime when not targeting a specific CPU (this is currently not done correctly) # User-mode scheduling Windows 7 introduced user-mode scheduling[1] to replace fibers on 64-bit. Google implemented the same thing for Linux (perhaps even before Windows 7 was released), and plans on pushing for it upstream.[2] The linked video does a better job of covering this than I can. User-mode scheduling provides a 1:1 threading model including full support for normal thread-local data and existing debuggers/profilers. It can yield to the scheduler on system calls and page faults. The operating system is responsible for details like context switching, so a large maintenance/portability burden is dealt with. It narrows down the above disadvantage list to just the point about not having preemptive/fair scheduling and doesn't introduce any new ones. I hope this is where concurrency is headed, and I hope Rust doesn't miss this boat by concentrating too much on libuv. I think it would allow us to simply drop support for pseudo-blocking I/O in the Go style and ignore asynchronous I/O and non-blocking sockets in the standard library. It may be useful to have the scheduler use them, but it wouldn't be essential. [1] http://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx [2] http://www.youtube.com/watch?v=KXuZi9aeGTw