Interview: the return of the realtime preemption tree

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

The realtime preemption project is a longstanding effort to provide deterministic response times in a general-purpose kernel. Much code resulting from this work has been merged into the mainline kernel over the last few years, and a number of vendors are shipping commercial products based upon it. But, for the last year or so, progress toward getting the rest of the realtime work into the mainline has slowed.

On February 11, realtime developers Thomas Gleixner and Ingo Molnar resurfaced with the announcement of a new realtime preemption tree and a newly reinvigorated development effort. Your editor asked them if they would be willing to answer a few questions about this work; their response went well beyond the call of duty. Read on for a detailed look at where the realtime preemption tree stands and what's likely to happen in the near future.

LWN: The 2.6.29-rc4-rt1 announcement notes that you're coming off a 1.5-year sabbatical. Why did you step away from the RT patches so long; have you been hanging out on the beach in the mean time? :)

Thomas: We spent a marvelous time at the x86 lagoon, a place with an extreme contrast of antiquities and modern art. :) : We spent a marvelous time at the x86 lagoon, a place with an extreme contrast of antiquities and modern art. :) Seriously, we underestimated the amount of work which was necessary to bring the unified x86 architecture into shape. Nothing to complain about; it definitely was and still is a worthwhile effort and I would not hesitate longer than a fraction of a second to do it again. Ingo: Yeah, hanging out on the beach for almost two years was well-deserved for both of us. We met Linus there and it was all fun and laughter, with free beach cocktails, pretty sunsets and camp fires. [ All paid for by the nice folks from Microsoft btw., - those guys sure know how to please a Linux kernel hacker! ;-) ]

So what has brought you back to the realtime work at this time?

Thomas: Boredom and nostalgia :) In fact I never lost track of the real time work since we took over x86 maintenance, but my effort was restricted to decode hard to solve problems and make sure that the patches were kept in a usable shape. Right now I have the feeling that we need to put more development effort into preempt-rt again to keep its upstream visibility and make progress on merging the remaining parts. : Boredom and nostalgia :) In fact I never lost track of the real time work since we took over x86 maintenance, but my effort was restricted to decode hard to solve problems and make sure that the patches were kept in a usable shape. Right now I have the feeling that we need to put more development effort into preempt-rt again to keep its upstream visibility and make progress on merging the remaining parts. The most important reason for returning was of course our editors challenge in The Grumpy Editor's guide to 2009: "The realtime patch set will be mostly merged by the end of the year..." Ingo: When we left for the x86 land more than 1.5 years ago, the -rt patch-queue was a huge pile of patches that changed hundreds of critical kernel files and introduced/touched ten thousand new lines of code. Fast-forward 1.5 years and the -rt patchqueue is a humungous pile of patches that changes nearly a thousand critical kernel files and introduces/touches twenty-thirty thousand lines of code. So we thought that while the project is growing nicely, it is useful and obviously people love it - the direction of growth was a bit off and that this particular area needs some help. Initially it started as a thought experiment of ours: how much time and effort would it take to port the most stable -rt patch (.26-rt15) to the .29-tip tree and could we get it to boot? Turns out we are very poor at thought experiments (just like we are pretty bad at keeping patch queues small), so we had to go and settle the argument via some hands-on hacking. Porting the queue was serious fun, it even booted after a few dozen fixes, and the result was the .29-rt1 release. Maintaining the x86 tree for such a long time and doing many difficult conceptual modernizations in that area was also very helpful when porting the -rt patch-queue to latest mainline. Most of the code it touched and most of the conflicts that came up looked strangely familiar to us, as if those upstream changes went through our trees =;-) (It's certainly nothing compared to the beach experience though, so we are still looking at returning for a few months to a Hawaii cruise.)

How well does the realtime code work at this point? What do you think are the largest remaining issues to be tackled?

Thomas: The realtime code has reached quite a stable state. The 2.6.24/26 based versions can definitely be considered production ready. I spent a lot of time to sort out a huge amount of details in those versions to make them production stable. Still we need to refactor a lot of the patches and look for mainline acceptable solutions for some of the real time related changes. : The realtime code has reached quite a stable state. The 2.6.24/26 based versions can definitely be considered production ready. I spent a lot of time to sort out a huge amount of details in those versions to make them production stable. Still we need to refactor a lot of the patches and look for mainline acceptable solutions for some of the real time related changes. Ingo: To me what settled quite a bit of "do we need -rt in mainline" questions were the spin-mutex enhancements it got. Prior that there were a handful of pretty pathologic workload scenarios where -rt performance tanked over mainline. With that it's all pretty comparable. The patch splitup and patch quality has improved too, and the queue we ported actually builds and boots at just about every bisection point, so it's pretty usable. A fair deal of patches fell out of the .26 queue because they went upstream meanwhile: tracing patches, scheduler patches, dyntick/hrtimer patches, etc. It all looks a lot less scary now than it looked 1.5 years ago - albeit the total size is still considerable, so there's definitely still a ton of work with it.

What are your current thoughts with regard to merging this work into the mainline?

Thomas: First of all we want to integrate the -rt patches into our -tip git repository which makes it easier to keep -rt in sync with the ongoing mainline development. The next steps are to gradually refactor the patches either by rewriting or preferably by pulling in the work which was done in Stevens git-rt tree, split out parts which are ready and merge them upstream step by step. : First of all we want to integrate the -rt patches into our -tip git repository which makes it easier to keep -rt in sync with the ongoing mainline development. The next steps are to gradually refactor the patches either by rewriting or preferably by pulling in the work which was done in Stevens git-rt tree, split out parts which are ready and merge them upstream step by step. Ingo: IMO the key thought here is to move the -rt tree 'ahead of the upstream development curve' again, and to make it the frontier of Linux R&D. With a 2.6.26 basis that was arguably hard to do. With a partly-2.6.30 basis (which the -tip tree really is) it's a lot more ahead of the curve, and there are a lot more opportunities to merge -rt bits into upstream bits wherever there's accidental upstream activity that we could hang -rt related cleanups and changes onto. We jumped almost 4 full kernel releases, that moves -rt across 1 year worth of upstream development - and keeps it at that leading edge. Another factor is that most of the top -rt contributors are also -tip contributors so there's strong synergy. The -tip tree also undergoes serious automated stabilization and productization efforts, so it's a good basis for development _and_ for practical daily use. For example there were no build failures reported against .29-rt1, and most of the other failures that were reported were non-fatal as well and were quickly fixed. One of the main things we learned in the past 1.5 years was how to keep a tree stable against a wild, dangerous looking flux of modifications. YMMV ;-)

Thomas once told me about a scheme to patch rtmutex locks into/out of the kernel at boot time, allowing distributors to ship a single kernel which can run in either realtime or "normal" mode. Is that still something that you're working on?

Thomas: We are not working on that right now, but it is still on the list of things which need to be investigated. : We are not working on that right now, but it is still on the list of things which need to be investigated. Ingo: That still sounds like an interesting feature, but it's pretty hard to pull it off. We used to have something rather close to that, a few years ago: a runtime switch that turned the rtmutex code back into spinning code. It was fragile and hard to maintain and eventually we dropped it. Ideally it should be done not at boot time but runtime - via the stop-machine-run mechanism or so. [extended perhaps with hibernation bits that force each task into hitting user-mode, so that all locks in the system are released] It's really hard to implement it, and it is definitely not for the faint hearted.

The RT-preempt code would appear to be one of the biggest exceptions to the "upstream first" rule, which urges code to be merged into the mainline before being shipped to customers. How has that worked out in this case? Are there times when it is good to keep shipping code out of the mainline for such a long time?

Thomas: It is an exception which was only acceptable because preempt-rt does not introduce new user space APIs. It just changes the run time behaviour of the kernel to a deterministic mode. : It is an exception which was only acceptable because preempt-rt does not introduce new user space APIs. It just changes the run time behaviour of the kernel to a deterministic mode. All changes which are user space API related (e.g. PI futexes) were merged into mainline before they got shipped to customers via preempt-rt and all bug fixes and improvements of mainline code were sent upstream immediately. Preempt-rt was never a detached project which did not care about mainline. When we started preempt-rt there was huge demand on the customer side - both enterprise and embedded - for an in kernel realtime solution. The dual kernel approaches of RTAI, RT-Linux and Xenomai had no chance to get ever accepted into the mainline and the handling of the dual kernel environment has never been an easy task. With preempt-rt you just switch the kernel under a stock mainline user space environment and voila your application behaves as you would expect - most of the time :) Dual kernel environments require different libraries, different APIs and you can not run the same binary on a non -rt enabled kernel. Debugging preempt-rt based real time applications is exactly the same as debugging non real time applications. [PULL QUOTE: While we never had doubts that it would be possible to turn Linux into a real time OS, it was clear from the very beginning that it would be a long way until the last bits and pieces got merged. END QUOTE] While we never had doubts that it would be possible to turn Linux into a real time OS, it was clear from the very beginning that it would be a long way until the last bits and pieces got merged. The first question Ingo asked me when I contacted him in the early days of preempt-rt was: "Are you sure that you want to touch every part of the kernel while working on preempt-rt?". This question was absolutely legitimate; in the first days of preempt-rt we really touched every part of the kernel due to problems which were mostly locking and preemption related. The fixes have been merged upstream and especially in the locking area we got a huge improvement in mainline due to lock debugging, conversion to mutexes, etc. and a general better awareness of locking and preemption semantics. preempt-rt was always a great breeding ground for fundamental changes in the kernel and so far quite a large part of the preempt-rt development has been integrated into the mainline: PI-futexes, high-resolution timers ... I hope we can keep that up and provide soon more interesting technological changes which emerged originally from the preempt-rt efforts. Ingo: Preempt-rt turns the kernel's scheduling, lock handling and interrupt handling code upside down, so there was no realistic way to merge it all upstream without having had some actual field feedback. It is also unique in that you need _all_ those changes to have the new kernel behavior - there's no real gradual approach to the -rt concept itself. That adds up to a bit of a catch-22: you don't get it upstream without field use, and you don't get field use without it being upstream. Deterministic execution is a major niche, one of which was not effectively covered by the mainstream kernel before. It's perhaps the last major technological niches in existence that the stock upstream kernel does not handle yet, and it's no wonder that the last one out is in that precise position for conceptually hard reasons. In short: all the easy technologies are upstream already ;-) Nevertheless we strictly got all user-ABI changes upstream first: PI-futexes in particular. The rest of -rt is "just" a new kernel option that magically turns kernel execution into deterministic mode.

Where would be the best starting point for a developer who wishes to contribute to this effort?

Thomas: Nothing special with the realtime patches. Just kernel development as usual: get the patches, apply them, run them on your machine and test. If problems arise, provide bug reports or try to fix them yourself and send patches. Read through the code and start providing improvements, cleanups ... : Nothing special with the realtime patches. Just kernel development as usual: get the patches, apply them, run them on your machine and test. If problems arise, provide bug reports or try to fix them yourself and send patches. Read through the code and start providing improvements, cleanups ... Ingo: Beyond the "try it yourself, follow the discussions, and go wherever your heart tells you to go" suggestion, there's a few random areas that might need more attention: Big Kernel Lock removal. It's critical for -rt. We still have the tip:core/kill-the-BKL branch, and if someone is interested it would be nice to drive that effort forward. A lot of nice help-zap-the-BKL patches went upstream recently (such as the device-open patches), so we are in a pretty good position to try the kill-the-BKL final hammer approach too. [I have just done a (raw!) refresh and conflict resolution merge of that tree to v2.6.29-rc5. Interested people can find it at: git pull \ git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git \ core/kill-the-BKL Warning: it might not even build. ]

Warning: it might not even build. ] Look at Steve's git-rt tree and split out and gradually merge bits. A fair deal of stuff has been cleaned up there and it would be nice to preserve that work.

Latency measurements and tooling. Go try the latency tracer, the function graph tracer and ftrace in general. Try to find delays in apps caused by the kernel (or caused by the app itself), and think about whether the kernel's built-in tools could be improved.

Try Thomas's cyclictest utility and try to trace and improve those worst-case latencies. A nice target would be to push the worst-case latencies on a contemporary PC below 10 microseconds. We were down to about 13 microseconds with a hack that threaded the timer IRQ with .29-rt1, so it's possible to go below 10 microseconds i think.

And of course: just try to improve the mainline kernel - that will improve the -rt kernel too, by definition :-) But as usual, follow your own path. Independent, critical thinking is a lot more valuable than follow-the-crowd behavior. [As long as it ends up producing patches (not flamewars) that is ;-)] And by all means, start small and seek feedback on lkml early and often. Being a good and useful kernel developer is not an attribute but a process, and good processes always need time, many gradual steps and a feedback loop to thrive.

Many thanks to Thomas and Ingo for taking the time to answer (in detail!) this long list of questions.

