I’d like to talk about some nitty-gritty details of GHC’s thread scheduling, discovered over the course of working on stride scheduling for GHC. Most of these choices are merely implementation details and are not part of any specification. While these choices shouldn’t be relied upon, they are worth knowing, since many of these details were accreted over the course of many performance bugs, benchmark tests and other battles. In this post, I’ll attempt to give some historical insight into why many choices were made. These insights should generalize to any system that would like to implement green threads, lightweight threads that use less memory than traditional operating system threads. For space reasons, I’m not going to talk about STM or sparks (though they are also quite interesting).

Update: A large portion of this material has been incorporated into the scheduler page in the GHC commentary

Anatomy of a thread I’d first like to discuss some brief background about the runtime system first and point out some perhaps nonintuitive design choices. A thread is represented by a TSO (thread-state object) by GHC, i.e. the StgTSO struct in includes/rts/storage/TSO.h . [1] In Haskell, TSOs can be passed around as ThreadId objects. The Stg in front of the struct name indicates that TSOs are garbage collected, like other closures in Haskell. The TSO, along with the stack allocated with it (STACK), constitute the primary memory overhead of a thread. Default stack size, in particular, is controlled by the GC flag -ki , and is 1k by default. [2] Threads are run by Capabilities, which can be thought of virtual cores managed by GHC. Capabilities are, in turn, mapped to true operating system threads, or Tasks, though we won’t talk about them much. Being garbage collected has two major implications for TSOs. First, TSOs are not GC roots, so they will get GC'd if there is nothing holding on to them (e.g. in the case of deadlock), and their space is not automatically reclaimed when they finish executing [3]. Usually, a TSO will be retained by a Capability’s run queue (a GC root), or in the list of waiting threads of some concurrency variable, e.g. an MVar. Second, a TSO must be considered a mutable object, and is thus subject to the conventional GC write barriers necessary for any mutable object in a generational garbage collector. [4] The dirty bit tracks whether or not a TSO has been modified; it is always set when a thread is run and also when any of the pointer fields on a TSO are modified. Two fields, set by setTSOLink and setTSOPrev , are of particular interest to the scheduler.

Run queue The run queue is at the heart of the scheduler, as any runnable thread will hit the run queue before the scheduler actually pops it off the queue and runs it. There’s one per capability rts/Capability.h (in the bad old days, there was a global run queue, but this performed badly for multithreaded processes), and it is implemented as a doubly-linked list run_queue_hd and run_queue_tl . [6] The head and tail pointers mean that the queue is actually a deque: this is important because the scheduler will often have to handle threads that were interrupted in some way, and should let the threads get back on. The links themselves are on the TSOs and modified with setTSOLink and setTSOPrev , so modifying the queue dirties the TSOs involved. [7] Otherwise, the run queue is exclusively owned by the scheduler. If there are idle capabilities and if we have more than one thread left in our run queue, threads will be pushed to other queues with schedulePushWork . Threads are put in front ( pushOnRunQueue ) if: A stack overflow occurs;

A heap overflow occurs; [8]

A task attempts to run a thread, but it is bound and the current task is the wrong one;

A thread is associated with a black hole (a thunk that is being evaluated), and another thread, possibly on another capability, has blocked on its evaluation (see ticket #3838);

In the threaded runtime, if a thread was interrupted because another Capability needed to do a stop-the-world GC (see commit 6d18141d8 );

); In the non-threaded runtime, when a thread waiting on IO unblocks. Threads are put in back ( appendToRunQueue ) in the case of pre-emption, or if it’s new; particularly, if A thread was pre-empted via the context switch flag (e.g. incoming message from another thread, the timer fired, the thread cooperatively yielded, etc; see also [8] on how this interacts with heap overflows);

It is a new thread (so large amounts of thread creation do not starve old threads, see conc004 and commit 05881ecab );

and commit ); A thread becomes unblocked;

A thread is migrated to another capability (though, in this case, the queue was empty anyway);

A thread finishes, but for some reason we need to keep it around (this is related to in-calls, though I’m not a 100% sure what is going on here; if you know, please tell me!)

Benchmarks Benchmarks like nofib are very important, even if they are synthetic, as they will often be construed as primary evidence whether or not a change to the scheduler speeds or slows things down. One reason is that it is much easier to tell why a short program that torture tests threads has slowed down than it is to tell why a large, complicated multithreaded program no longer seems very snappy. But really, the main motivation is convenience: nofib programs are easy to measure and easy to compare. Fortunately, the tests often measure something quite specific, so I’d like to describe the tests that compose the smp nofib suite here: callback001 (also known as ffi014 ) performs a large number of incalls to Haskell from C from a large number of threads. This is a rather specific test related to how we place threads in the run queue even if they’ve finished, if they finished in an in-call.

(also known as ) performs a large number of incalls to Haskell from C from a large number of threads. This is a rather specific test related to how we place threads in the run queue even if they’ve finished, if they finished in an in-call. callback002 measures how quickly we can perform incalls to Haskell from C.

measures how quickly we can perform incalls to Haskell from C. chan measures how scheduling order effects memory usage: if threads are allowed to run for a bit without getting context switched, they build up data in channels. This is related to when we reset the context switch flag (see [8]).

measures how scheduling order effects memory usage: if threads are allowed to run for a bit without getting context switched, they build up data in channels. This is related to when we reset the context switch flag (see [8]). sieve implements the Sieve of Eratosthenes, spawning many threads to evaluate thunks of a lazy list in parallel. It performs a bit of allocation, and sensitive to what happens to threads after a HeapOverflow.

implements the Sieve of Eratosthenes, spawning many threads to evaluate thunks of a lazy list in parallel. It performs a bit of allocation, and sensitive to what happens to threads after a HeapOverflow. threads001 tests how quickly we can create a thread and then context switch to it.

tests how quickly we can create a thread and then context switch to it. threads003 tests how quickly many threads can communicate by reading and writing MVars. It is a bit sensitive to what happens to threads after they wake up from sleeping.

tests how quickly many threads can communicate by reading and writing MVars. It is a bit sensitive to what happens to threads after they wake up from sleeping. threads006 tests how quickly threads can be created and destroyed, as well as throwTo blocking performance. It is very sensitive to the number of major GCs that occur (which can be influenced if TSO size changes).

tests how quickly threads can be created and destroyed, as well as blocking performance. It is very sensitive to the number of major GCs that occur (which can be influenced if TSO size changes). threads007 generates a lot of threads waiting on MVars, and then sees how shutdown behavior is affected. It was due to bad behavior in the MVar queue and fixed in f4692220c7 .