The realtime preemption mini-summit

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

Prior to the Eleventh Real Time Linux Workshop in Dresden, Germany, a small group met to discuss the further development of the realtime preemption work for the Linux kernel. This "mini-summit" covered a wide range of topics, but was driven by a straightforward set of goals: the continuing improvement of realtime capabilities in Linux and the merging of the realtime preemption patches into the mainline.

The participants were: Stefan Assmann, Jan Blunck, Jonathan Corbet, Sven-Thorsten Dietrich, Thomas Gleixner, Darren Hart, John Kacur, Paul McKenney, Ingo Molnar, Oleg Nesterov, Steven Rostedt, Frederic Weisbecker, Clark Williams, and Peter Zijlstra. Together they represented several companies working in the area of realtime Linux; they brought a lot of experience with customer needs to the table. The discussion was somewhat unstructured - no formal agenda existed - but a lot of useful topics were covered.

Threaded interrupt handlers came out early in the discussion. This feature was merged into the mainline for the 2.6.30 kernel; it is useful in realtime situations because it allows interrupt handlers to be prioritized and scheduled like any other process. There is one part of the threaded interrupt code which remains outside of the mainline: the piece which forces all drivers to use threaded handlers. There are no plans to move that code into the mainline; instead, it's going to be a matter of persuasion to get driver writers to switch to the newer way of doing things.

Uptake in the mainline is small so far; few drivers are actually using this feature. That is beginning to change, though; the SCSI layer is one example. SCSI has always featured relatively heavyweight interrupt-handling code and work done in single-threaded workqueues. This code could move fairly naturally to process context; the SCSI developers are said to be evaluating a possible move toward threaded interrupt handlers in the near future. There have also been suggestions that the network stack might eventually move in that direction.

System management interrupts (SMIs) are a very different sort of problem. These interrupts happen at a very low level in the hardware and are handled by the BIOS code. They often perform hardware monitoring tasks, from simple thermal monitoring to far more complex operations not normally associated with BIOS-level software. SMIs are almost entirely invisible to the operating system and are generally not subject to control at that level, but they are visible in some important ways: they monopolize anything between one CPU and all CPUs in the system for a measurable period of time, and they can change important parameters like the system clock rate. SMIs on some types of hardware can run for surprisingly long periods; one vendor sells systems where an SMI for managing ECC memory runs for 200µs every three minutes. That is long enough to play havoc with any latency deadlines that the operating system is trying to meet.

Dealing with the SMI problem is a challenge. Some hardware allows SMIs to be disabled, but it's never clear what the consequences of doing so might be; if the CPU melts into a puddle of silicon, the resulting latencies will be even worse than before. Sharing information about SMI problems can be hard because many of the people working in this area are working under non-disclosure agreements with the hardware vendors; this is unfortunate, because some vendors have done a far better job of avoiding SMI-related latencies than others. There is a tool now (hwlat_detector) which can measure SMI latency, so we should start seeing more publicly-posted information on this issue. And, with luck, vendors will start to deal with the problem.

Not all hardware latency is caused by SMIs; hypervisors, too, can be a significant source of latency problems.

A related issue is hardware changes imposed by SMI handlers. If the BIOS determines that the system is overheating, it may respond by slowing the clock rate or lowering the processor voltage. On a throughput-oriented system, that may well be the right thing to do. When latencies are important, though, slowing the processor could be a mistake - it could cause applications to miss their deadlines. A better response might be to simply shut down some processors while keeping others at full speed. What is really needed here is a way to get this information to user space so that policy decisions can be made there.

Testing is always an issue in this kind of software development; how do the developers know that they are really making things better? There are various test suites out there (RTMB, for example), but there is no complete and integrated test suite. There was some talk of trying to move more of the realtime testing code into the Linux Test Project, but LTP is a huge body of code. So the realtime tests might remain on their own, but it would be nice, at least, to standardize test options and output formats to help with the automation of testing. XML output from test programs is favored by some, but it is fair to say that XML is not universally loved in this crowd.

The big kernel lock is a perennial outstanding issue for realtime development for a couple of reasons. One is that, despite having been pushed out of much of the core code, the BKL can still create long latencies. The other is that elimination of the BKL would appear to be part of the price for an eventual merge of sleeping spinlocks into the mainline kernel. The ability to preempt code running under the BKL was removed in 2.6.26; this change was directly motivated by a performance regression caused by the semaphore rewrite, but it was also seen as a way to help inspire BKL-removal efforts by those who care about latencies.

Much of the hard work in getting rid of the BKL has been done; one big outstanding piece is the conversion of reiserfs being done by Frederic Weisbecker. After that, what's left is a lot of grunt work: figuring out what (if anything) is protected by a lock_kernel() call and putting in proper locking. The "tip" tree has a branch (rt/kill-the-bkl) where this work can be coordinated and collected.

Signal delivery is still not an entirely solved problem. Actually, signals are always a problem, for implementers and users alike. In the realtime context, signal delivery has some specific latency issues. Signal delivery to thread groups involves an O(n) algorithm to determine which specific thread to target; getting through this code can create excessive latencies. There are also some locks in the delivery path which interfere with the delivery of signals in realtime interrupt context.

Everybody agrees that the proper solution is to avoid signals in applications whenever possible. For example, timerfd() can be used for timer events. But everybody also agrees that applications will continue to use signals, so they have to be made to work somehow. The probable solution is to remove much of the work from the immediate signal delivery path. Signal delivery would just enqueue the information and set a bit in the task structure; the real work would then be done in the context of the receiving process. That work might still be expensive, but it would at least fall to the process which is actually using signals instead of imposing latencies on random parts of the system.

A side discussion on best practices for efficient realtime application development yielded a few basic recommendations. The best API to use, it turns out, is the basic pthread interface; it has been well optimized over time. SYSV IPC is best avoided. Cpusets work better than the affinity mechanism for CPU isolation. In general, developers should realize that getting the best performance out of a realtime system will require a certain amount of manual tuning effort. Realtime Linux allows the prioritization of things like interrupt handlers, but the hard work of figuring out what those priorities should be can only be done by developers or administrators. It was acknowledged that the interfaces provided to administrators currently are not entirely easy to use; it can be hard to identify interrupt threads, for example. Red Hat's tuna tool can help in this regard, but more needs to be done.

Scalability was a common theme at the meeting. As a general rule, realtime development has not been focused specifically on scalability issues. But there is interest in running realtime applications on larger systems, and that is bringing out problems. The realtime kernel tends to run into scalability problems before the mainline kernel does; it was described as an early warning system which highlights issues that the mainline will be dealing with five years from now. So realtime will tend to scale more poorly than mainline, but fixing realtime's problems will eventually benefit mainline users as well.

Darren Hart presented a couple of charts containing the results of some work by John Stultz showing the impact of running the realtime kernel on a 24-processor system. When running in anything other than uniprocessor mode, the realtime kernel imposes a roughly 50% throughput penalty on a suitably pathological workload - a severe price. Interestingly, if the locking changes from the realtime kernel are removed while leaving all of the other changes, most of the performance loss goes away. This has led Darren to wonder if there should be a hybrid option available for situations where hard latency requirements are not present.

In other situations, the realtime kernel generally shows performance degradation starting with eight CPUS, with sixteen showing unacceptable overhead.

As it happens, nobody really understands where the performance cost of realtime locking comes from. It could be in the sleeping spinlocks, but there is also a lot of suspicion directed at reader-writer locks. In the mainline kernel, rwlocks allow multiple readers to run in parallel; in the realtime kernel, instead, only one reader runs at a time. That change is necessary to make priority inheritance work; priority inheritance in the presence of multiple readers is a difficult problem. One obvious conclusion that comes from this observation is that, perhaps, rwlocks should not implement priority inheritance. There is resistance to that idea, though; priority inheritance is important in situations where the highest-priority process should always run as quickly as possible.

The alternative to changing rwlocks is to simply stop using them whenever possible. The usual way to remove an rwlock is to replace it with a read-copy-update scheme. Switching to RCU will improve scalability, arguably at the cost of increasing complexity. But before embarking on any such effort, it is important to get a handle on how much of the problem really comes down to rwlocks. Some research will be done in the near future to better understand the source of the scalability problems.

Another problem is per-CPU variables, which work by disabling preemption while a specific variable is being used. Disabling preemption is anathema to the realtime developers, so per-CPU variables in the realtime tree are protected by sleeping locks instead. That increases overhead. The problem is especially acute in slab-level memory allocators, which make extensive use of per-CPU variables.

Solutions take a number of forms. There will eventually be a more realtime-friendly slab allocator, probably a variant of SLQB. Minimizing the use of per-CPU variables in general makes sense for realtime. There are also schemes involving the creation of multiple virtual "CPUs" so that even processes running on the same processor can have their own "per-CPU" variables. That decreases contention for those variables considerably at the cost of a slightly higher cache footprint.

Plain old locks can also be a problem; a run of dbench on a 16-processor system during the workshop showed a 90% reduction in throughput, with the processors sitting idle half the time. The problem in this case turns out to be dcache_lock , one of the last global spinlocks remaining in the kernel. The realtime tree feels the effects of this lock more strongly for a couple of reasons. One is that threads holding the lock can be preempted; that leads to longer lock hold times and more context switches. The other is that sleeping spinlocks are simply more complicated, especially in the contended slow path of the code. So the locking primitives themselves require more CPU time.

The solution to this particular problem can only be the elimination of the global dcache_lock . Nick Piggin has a patch set which does exactly that, but it has not yet been tested with the realtime tree.

Realtime makes life harder for the scheduler. On a normal system, the scheduler can optimize for overall system throughput. The constraints imposed by realtime, though, require the scheduler to respond much more aggressively to events. So context switches are higher and processes are much more likely to migrate between CPUs - better for bounded response times, but worse for throughput. By the time the system scales up to something relatively large - 128 CPUs, say - there does not seem to be any practical way to get consistently good decisions from the scheduler.

There is some interest in deadline-oriented schedulers. Adding an "earliest deadline first" or related scheduler could be useful for application developers, but nobody seems to feel that a deadline scheduler would scale better than the current code.

What all this means is that realtime applications running on that kind of system must be partitioned. When specific CPUs are set aside for specific processes, the scheduling problem gets simpler. Partitioning requires real work on the part of the administrator, but it seems unavoidable for larger systems.

It doesn't help that complete CPU isolation is still hard to accomplish on a Linux system. Certain sorts of operations, such as workqueue flushes, can spill into a processor which has been set aside for specific processes. In general, anything involving interrupts - both device interrupts and inter-processor interrupts - is a problem when one is trying to dedicate a CPU to a task. Steering device interrupts to a given processor is not that hard, though the management tools could use improvement. Inter-processor interrupts are currently harder to avoid; code generating IPIs needs to be reviewed and, when possible, modified to avoid interrupting processors which do not actually have work to do.

Integrating interrupt management into the current cpuset and control group code would be useful for system administrators. That seems to be a harder task; Paul Jackson, the original cpuset developer, was strongly opposed to trying to include interrupt management there. There's a lack of good abstractions for this kind of administration, though the generic IRQ layer helps. The opinion at the meeting seemed to be that this was a solvable problem; if it can be solved for the x86 architecture, the other architectures will eventually follow.

Going to a fully tickless kernel is also an important step for full CPU isolation. Some work has recently been done in that direction, but much remains to be done.

Stable kernel ABI concerns made a surprising appearance. The "enterprise" Linux offerings from distributors generally include a promise that the internal kernel interface will not change. The realtime enterprise distributions have been an exception to this rule, though; the realtime code is simply in too much flux to make such a promise practical. This exemption has made life easier for developers working on that code, naturally; it also has made it possible for customers to get the newest code much more quickly. There are some concerns that, once the remaining realtime code is merged into the mainline, the same kernel ABI constraints may be imposed on realtime distributions. It is not clear that this needs to happen, though; realtime customers seem to be more interested in keeping up with newer technology and more willing to put up with large changes.

Future work was discussed briefly. Some of the things remaining to be done include:

More SMP work, especially on NUMA systems.

A realtime idle loop. There is the usual tension there between preserving the best response time and minimizing power consumption.

Supporting hardware-assisted operations - things like onboard cryptographic acceleration hardware.

Elimination of the timer tick.

Synchronization of clock events across CPUs. Clock synchronization is always a challenging task. In this case, it's complicated by the fact that a certain amount of clock skew can actually be advantageous on an SMP system. If clock events are strictly synchronized, processors will be trying to do things at the same time and lock contention will increase.

A near-future issue is spinlock naming. Merging the sleeping spinlock code requires a way to distinguish between traditional, spinning locks and the newer type of lock which might sleep on a realtime system. The best solution, in theory, is to rename sleeping locks to something like lock_t , but that would be a huge change affecting many thousands of files. So the realtime developers have been contemplating a new name for non-sleeping locks instead. There are far fewer of these locks, so renaming them to something like atomic_spinlock would be much less disruptive.

There was some talk of the best names for "atomic spinlocks"; they could be "core locks," "little kernel locks," or "dread locks." What really came out of the discussion, though, is that there was a fair amount of confusion regarding the two types of locks even in this group, which understands them better than anybody else. That suggests that some extra care should go into the naming, with the goal of making the locking semantics clear and discouraging the use of non-sleeping locks. If the semantics of spinlock_t change, there is a good argument that the name should also change. That supports the idea of the massive lock renaming, regardless of how disruptive it might be.

Whether such a change would be accepted is an open question, though. For now, both the small renaming and the massive renaming will be prepared for review. The issue may then be taken to the kernel summit in October for a final decision.

Tools for realtime developers came up a couple of times. There are a number of tools for application optimization now, but they are scattered and not always easy to use. And, it is said, there needs to be a tool with a graphical interface or a lot of users simply will not take it seriously. The "perf" tool, part of the kernels "performance events" subsystem, seems poised to grow into this role. It can handle many of the desired tasks - latency tracing, for example - now, and new features are being added. The "tuna" tool may be extended to provide a nicer interface to perf.

User-space tracepoints seem to be high on the list of desirable features for application developers. Best would be to integrate these tracepoints with ftrace somehow. Alternatively, user-space trace data could be collected separately and integrated with kernel trace data at postprocessing time. That leads to clock synchronization issues, though, which are never easy to solve.

The final part of the meeting became a series of informal discussions and hacking efforts. The participants universally saw it as a worthwhile gathering, with much learned by all. There are some obvious action items, including more testing to better understand scalability problems, increasing adoption of threaded interrupt handlers, solving the spinlock naming problem, improving tools, and more. Plenty of work for all to do. But your editor has been assured that the work will be done and merged in the next year - for real this time.

