Introduction

Back in February last year, I wrote about my attempts at porting the Causal Profiling algorithm to the Go runtime.

Since then, development has stalled, but I did find some time recently to update it to the newest Go version. I realized during the rebase that the patchset could be simplified greatly and give far more accurate results. In this blog post, I'm going to go into some of the details and counterintuitive findings from this simplification.

For the rest of this post, I assume some familiarity with Causal Profiling and the Go scheduler, which I also wrote about some time back.

What went wrong?

Causal Pofiling works by performing experiments. A line is chosen and we start a profiler that samples when that line is being executed. When a signal is triggered, the algorithm selectively delays execution of all other threads in the system. The end effect is that we virtually speed up the chosen line. During this experiment, we also perform a measurement on some code that we've instrumented to find if it's sped up. This would usually be something like an important HTTP request that we want to optimize. Once the experiment is finished, we take the delays added by the profiler and subtract it from the instrumented measurements and that's our result.

This leaves us in the position of choosing what constitutes a thread. The Go runtime has 3 things that are sufficiently thread-like. Gs, Ms and Ps. Gs are goroutines, with a stack and instruction pointer. They're the thing representing the Go code that you've written. Ms are operating system threads. They're tasked with executing the Gs. For various reasons, an M might not be executing Go code. The most common one is being in a syscall. Which leaves us with Ps. Ps are the scheduling context. As Ms move into syscalls, Ps are handed off between them to make sure that there's always GOMAXPROCS Ms executing Go code at any given time.

Choosing the M as the Causal Profiling thread is difficult. They tend to not be executing at all times and dealing with the syscall machinery makes it an inaccurate fit.

Choosing the P means that threads are always busy, removing a large part of the bookkeeping needed for Causal Profiling. It would seem that it's the perfect fit, but it does have some disadvantages. During the time that the other Ps are executing their delays, the sped-up P would be free to steal work off them. The execution would then not be delayed, which in turn would mean that there'd be a speed-up, even though the G causing the delay and the Gs executed subsequently wouldn't be causally connected.

It breaks down even further when you consider the case where there's only one P. A speed-up being applied then means that no delays are being inserted, turning the algorithm into a regular profiler.

To avoid these cases of implicit causal connections I ended up with the G as the thread. Since Gs are only ever unblocked by other Gs, it ended up simplifying the causal connections implementation significantly.

The part where it didn't work so well was the implementation of delays. Sleeping goroutines are implemented with a single timer process that sits on a thread by itself and enqueues goroutines to the scheduler when their sleeps have been executed. This allows the runtime to only have one sleeping OS thread, but thousands of sleeping goroutines, significantly reducing the amount of resources needed for handling timers.

For Causal Profiling, every time a given goroutine was scheduled, I'd check if it needed to be delayed and put it into the queue of the timer process. This led to lots of lock contention on internal state of the timer process. Additionally, since it involved 2 roundtrips into the scheduler for every goroutine execution, the overhead there was increased. This let to inaccurate delays and made the results noisy and susceptible to misinterpretation.

A realization

The noisy measurements were a large part of why I mostly abandoned the project. I've been rebasing patches so that they apply cleanly on new Go releases, but otherwise, I've found little time to do any development.

During the latest rebase I had a realization. If you're using Ps as the thread for the purposes of Causal Profiling, the implicit causal connections are actually a feature. They end up modeling the effects that would happen if the currently executing thread actually did get sped up.

Consider the 1 P scenario I described above. There is only one thread ever doing any work on executing goroutines and it can only execute one goroutine at a time. In such a scenario, speeding up any given execution correlates with a 1-to-1 speed-up of the final program. The causal connection being inherited by goroutines running on that P is exactly the end result we want.

This extends to the multiple P scenario. Say you have one P that has just had a speed-up applied to it and another one executing its delay. If the sped-up P finishes its current task and then work steals from the other P, that reduces the queuing delay that the stolen task would experience. That lines up perfectly with what would happen if an actual speed-up happened.

The implementation

With moving the thread concept onto Ps, there's an issue. Causal Profiling requires that threads that are unblocked by other threads are credited with the delays executed by unblocking thread. However, Ps never directly interact. Gs interact with other Gs and Ps are just sitting in a loop, finding work to do and executing it.

The solution to this issue is to treat Gs like tokens. Every time a G is executed on a P, we tag it with the delay count of that P. If it then gets scheduled onto another P, we will inherit the value of that delay.

Delays no longer involve the Go scheduler. Just like in the original Causal Profiling paper, we now execute the delays inside the signal handler. The OS scheduler is now handling these delays, which it does much more precisely. Since there are only going to be GOMAXPROCS Ps executing delays at any given time, the volume of sleeping threads is also kept to a minimum.

Evaluation

Based on my tiny experiments so far, the results are way less noisy than before and can actually be used to guide optimization.

The other thing that I've noticed is that Ps are way more causally connected than Gs. Because of this, delays tend to be handed off more between threads, meaning less delay overall. The end result is that speed-ups have a bigger effect. This made me suspicious at first, because I didn't want to fall into the trap of evaluating a profiling algorithm more favorably because it gives a bigger effect. However, thinking through the queuing delays and the execution model, I've convinced myself that the bigger effect is just it being more accurate.

If you want to play with Causal Profiling and aren't afraid of applying patches to the core Go runtime you're running, you can find it on my github. If you need some help getting started, feel free to reach out to me on the email address on the sidebar.

Aside: If you'd like to work with someone who does these kinds of experiments, I'm currently available for hire. Have a look at my CV and if you find it interesting, You can reach me on the email on the sidebar.