Understanding the kernel release cycle [Dec. 2nd, 2015|09:14 pm] jwboyer

A few days ago, a Fedora community member was asking if there was a 4.3 kernel for F23 yet (there isn't). When pressed for why, it turns out they were asking for someone that wanted newer support but thought a 4.4-rc3 kernel was too new. This was surprising to me. The assumption was being made that 4.4-rc3 was unstable or dangerous, but that 4.3 was just fine even though Fedora hasn't updated the release branches with it yet. This led me to ponder the upstream kernel release cycle a bit, and I thought I would offer some insights as to why the version number might not represent what most people think it does.



First, I will start by saying that the upstream kernel development process is amazing. The rate of change for the 4.3 kernel was around 8 patches per hour, by 1600 developers, for a total of 12,131 changes over 63 days total[1]. And that is considered a fairly calm release by kernel standards. The fact that the community continues to churn out kernels of such quality with that rate of change is very very impressive. There is actually quite a bit of background coordination that goes on between the subsystem maintainers, but I'm going to focus on how Linus' releases are formed for the sake of simplicity for now.



A kernel release is broken into a set of discrete, roughly time-based chunks. The first chunk is the 2 week merge window. This is the timeframe where the subsystem maintainers send the majority of the new changes for that release to Linus. He takes them in via git pull requests, grumbles about a fair number of them, refuses a few others. Most of the pull requests are dealt with in the first week, but there are always a few late ones so Linus waits the two weeks and then "closes" the window. This culminates in the first RC release being cut.



From that point on, the focus for the release is fixes. New code being taken at this point is fairly rare, but does happen in the early -rc releases. These are cut roughly every Sunday evening, making for a one week timeframe per -rc. Each -rc release tends to be smaller in new changesets than the previous, as the community becomes much more picky on what is acceptable the longer the release has gone on. Typically it gets to -rc7, but occasionally it will go to -rc8. One week after -rc7 is released, the "final" release is cut, which maps nicely with the 63 day timeframe quoted above.



Now, here is where people start getting confused. They see a "final" release and immediately assume it's stable. It's not. There are bugs. Lots and lots of bugs. So why would Linus release a kernel with a lot of bugs? Because finding them all is an economy of scale. Let's step back a second into the development and see why.



During the development cycle, people are constantly testing things. However, not everyone is testing the same thing. Each subsystem maintainer is often testing their own git tree for their specific subsystem. At the same time, they've opened their subsystem trees for changes for the next version of the kernel, not the one still in -rcX state. So they have new code coming in before the current code is even released. This is how they sustain that massive rate of change.



Aside from subsystem trees, there is the linux-next tree. This is daily merge of all the subsystem maintainer trees that have already opened up to new code on top of whatever Linus has in his tree. A number of people are continually testing linux-next, mostly through automated bots but also in VMs and running fuzzers and such. In theory and in practice, this catches bugs before they get to Linus the next round. But it is complicated because the rate of change means that if an issue is hit, it's hard to see if it's in the new new code only found in linux-next, or if it's actually in Linus' tree. Determining that usually winds up being a manual process via git-bisect, but sometimes the testing bots can determine the offending commit in an automated fashion.



If a bug is found, the subsystem maintainer or patch author or whomever must track which tree the bug is, whether it's a small enough fix to go into whatever -rcX state Linus' tree is in, and how to get it there. This is very much a manual process, and often involves multiple humans. Given that humans are terrible at multitasking in general, and grow ever more cautious the later the -rcX state is, sometimes fixes are missed or simply queued for the next merge window. That's not to say important bugs are not fixed, because clearly there are several weeks of -rcX releases where fixing is the primary concern. However, with all the moving parts, you're never going to find all the bugs in time.



In addition to the rate of change/forest of trees issue, there's also the diversity and size of the tester pool. Most of the bots test via VMs. VMs are wonderful tools, but they don't test the majority of the drivers in the kernel. The kernel developers themselves tend to have higher end laptops. Traditionally this was of the Thinkpad variety and a fair number of those are still seen, but there is some variance here now which is good. But it isn't good enough to cover all possible firmware, device, memory, and workload combinations. There are other testers to be sure, but they only cover a tiny fraction of the end user machines.



It isn't hard to see how bugs slip through, particularly in drivers or on previous generation hardware. I wouldn't even call it a problem really. No software project is going to cut a release with 0 bugs in it. It simply doesn't happen. The kernel is actually fairly high quality at release time in spite of this. However, as I said earlier, people tend to make assumptions and think it's good enough for daily use on whatever hardware they have. Then they're surprised when it might not be.



To combat this problem, we have the upstream stable trees. These trees backport fixes from the current development kernels that also apply to the already released kernels. Hence, 4.3.1 is Linus' 4.3 release, plus a number of fixes that were found "too late". This, in my opinion, is where the bulk of the work on making a kernel usable happens. It is also somewhat surprising when you look at it.



The first stable release of a kernel, a .1 release, is actually very large. It is often comprised of 100-200 individual changes that are being backported from the current development kernel. That means there are 100-200 bugs immediately being fixed there. Whew, that's a lot but OK maybe expected with everything above taken into account. Except the .2 release is often also 100-200 patches. And .3. And often .4. It isn't until you start getting into .5, .6, .7, etc that the patch count starts getting smaller. By the .9 release, it's usually time to retire the whole tree (unless it's a long-term stable) and start the fun all over again.



In dealing with the Fedora kernel, the maintainers take all of this into account. This is why it is very rare to see us push a 4.x.0 kernel to a stable release, and often it isn't until .2 that you see a build. For those thinking that this article is somehow deriding the upstream kernel development process, I hope you now realize the opposite is true. We rely heavily on upstream following through and tagging and fixing the issues it finds, either while under development or via the stable process. We help the stable process as well by reporting back fixes if they aren't already found.



So hopefully next time you're itching for a new kernel just because it's been released upstream, you'll pause and think about this. And if you really want to help, you'll grab a rawhide kernel and pitch in by reporting any issues when you find them. The only way to get the stable kernel releases smaller, and reduce the number of bugs still found in freshly released kernels, is to broaden the tester pool and let the upstream developers know as soon as possible. In this way, we're all part of the upstream kernel community and we can all keep making it awesome and impressive.



(4.3.y will likely be coming to F23 the first week of January. Greg-KH seems to have gone on some kind of walkabout the past few weeks so 4.3.1 hasn't been released yet. To be honest, it's a break well deserved. Or maybe he found 4.3 to be even more buggy as usual. Who knows!)



[1] https://lwn.net/Articles/661978/ (Of course I was going to link to lwn.net. If you aren't already subscribed to it, you really should be. They have amazing articles and technical content. They make my stuff look like junk even more than it already is. I'm kinda jealous at the energy and expertise they show in their writing.)