How 3.6 nearly broke PostgreSQL

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

pgbench

In mid-September, the 3.6 kernel appeared to be stabilizing nicely. Most of the known regressions had been fixed, the patch volume was dropping, and Linus was relatively happy. Then Nikolay Ulyanitsky showed up with a problem : thePostgreSQL benchmark ran 20% slower than under 3.5. The resulting discussion shows just how hard scalability can be on contemporary hardware and how hard scheduling can be in general.

Borislav Petkov was able to reproduce the problem; a dozen or so bisection iterations later he narrowed down the problem to this patch, which was duly reverted. There is just one little problem left: the offending patch was, itself, meant to improve scheduler performance. Reverting it fixed the PostgreSQL regression, but at the cost of losing an optimization that improves things for many (arguably most) other workloads. Naturally, that led to a search to figure out what the real problem was so that the optimization could be restored without harmful effects on PostgreSQL.

What went wrong

The kernel's scheduling domains mechanism exists to optimize scheduling decisions by modeling the costs of moving processes between CPUs. Migrating a process from one CPU to a hyperthreaded sibling is nearly free; cache is shared at all levels, so the moved process will not have to spend time repopulating cache with its working set. Moving to another CPU within the same physical package will cost more, but mid-level caches are still shared, so such a move is still much less expensive than moving to another package entirely. The current scheduling code thus tries to keep processes within the same package whenever possible, but it also tries to spread runnable processes across the package's CPUs to maximize throughput.

The problem that the offending patch (by Mike Galbraith) was trying to solve comes from the fact that the number of CPUs built into a single package has been growing over time. Not too long ago, examining every processor within a package in search of an idle CPU for a runnable process was a relatively quick affair. As the number of CPUs in a package increases, the cost of that search increases as well, to the point that it starts to look expensive. The current scheduler's behavior, Mike said at the time, could also result in processes bouncing around the package excessively. The result was less-than-optimal performance.

Mike's solution was to organize CPUs into pairs; each CPU gets one "buddy" CPU. When one CPU wakes a process and needs to find a processor for that process to run on, it examines only the buddy CPU. The process will be placed on either the original CPU or the buddy; the search will go no further than that even if there might be a more lightly loaded CPU elsewhere in the package. The cost of iterating over the entire package is eliminated, process bouncing is reduced, and things run faster. Meanwhile, the scheduler's load balancing code can still be relied upon to distribute the load across the available CPUs in the longer term. Mike reported significant improvements in tbench benchmark results with the patch, and it was quickly accepted for the 3.6 development cycle.

So what is different about PostgreSQL that caused it to slow down in response to this change? It seems to come down to the design of the PostgreSQL server and the fact that it does a certain amount of its own scheduling with user-space spinlocks. Carrying its own spinlock implementation does evidently yield performance benefits for the PostgreSQL project, but it also makes the system more susceptible to problems resulting from scheduler changes in the underlying system. In this case, restricting the set of CPUs on which a newly-woken process can run increases the chance that it will end up preempting another PostgreSQL process. If the new process needs a lock held by the preempted process, it will end up waiting until the preempted processes manages to run again, slowing things down. Possibly even worse is that preempting the PostgreSQL dispatcher process — also more likely with Mike's patch — can slow the flow of tasks to all PostgreSQL worker processes; that, too, will hurt performance.

Making things better

What is needed is a way to gain the benefits of Mike's patch without making things worse for PostgreSQL-style loads. One possibility, suggested by Linus, is to try to reduce the cost of searching for an idle CPU instead of eliminating the search outright. It appears that there is some low-hanging fruit in this area, but it is not at all clear that optimizing the search, by itself, will solve the entire problem. Mike's patch eliminates that search cost, but it also reduces movement of processes around the package; a fix that only addresses the first part risks falling short in the end.

Another possibility is to simply increase the scheduling granularity, essentially giving longer time slices to running processes. That will reduce the number of preemptions, making it less likely that PostgreSQL processes will step on each other's toes. Increasing the granularity does, indeed, make things better for the pgbench load. There may be some benefit to be had from messing with the granularity, but it is not without its risks. In particular, increasing the granularity could have an adverse effect on desktop interactivity; there is no shortage of Linux users who would consider that to be a bad trade.

Yet another possibility is to somehow teach the scheduler to recognize processes — like the PostgreSQL dispatcher — that should not be preempted by related processes if it can be avoided. Ingo Molnar suggested investigating this idea:

Add a kernel solution to somehow identify 'central' processes and bias them. Xorg is a similar kind of process, so it would help other workloads as well. That way lie dragons, but might be worth an attempt or two.

The problem, of course, is the dragons. The O(1) scheduler, used by Linux until the Completely Fair Scheduler (CFS) was merged for 2.6.23, had, over time, accumulated no end of heuristics and hacks designed to provide the "right" kind of scheduling for various types of workloads. All these tweaks complicated the scheduler code considerably, making it fragile and difficult to work with — and they didn't even work much of the time. This complexity inspired Con Kolivas's "staircase deadline scheduler" as a much simpler solution to the problem; that work led to the writing (and merging) of CFS.

Naturally, CFS has lost a fair amount of its simplicity since it was merged; contact with the real world tends to do that to scheduling algorithms. But it is still relatively free of workload-specific heuristics. Opening the door to more of them now risks driving the scheduler in a less maintainable, more brittle direction where nothing can be done without a significant chance of creating problems in unpredictable places. It seems unlikely that the development community wants to go there.

A potentially simpler alternative is to let the application itself tell the scheduler that one of its processes is special. PostgreSQL could request that its dispatcher be allowed to run at the expense of one of its own workers, even if the normal scheduling algorithm would dictate otherwise. That approach reduces complexity, but it does so by pushing some of the cost into applications. Getting application developers to accept that cost can be a challenge, especially if they are interested in supporting operating systems other than Linux. As a general rule, it is far better if things just work without the need for manual intervention of this type.

So, in other words, nobody really knows how this problem will be solved at this time. There are several interesting ideas to pursue, but none that seem like an obvious solution. Further research is clearly called for.

One good point in all of this is that the problem was found before the final 3.6 kernel shipped. Performance regressions have a way of hiding, sometimes for years, before they emerge to bite some important workload. Eventually, tools like Linsched may help to find more of these problems early, but we will always be dependent on users who will perform this kind of testing with workloads that matter to them. Without Nikolay's 3.6-rc testing, PostgreSQL users might have had an unpleasant surprise when this kernel was released.

