Testing for kernel performance regressions

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

It is not uncommon for software projects — free or otherwise — to include a set of tests intended to detect regressions before they create problems for users. The kernel lacks such a set of tests. There are some good reasons for this; most kernel problems tend to be associated with a specific device or controller and nobody has anything close to a complete set of relevant hardware. So the kernel depends heavily on early testers to find problems. The development process is also, in the form of the stable trees, designed to collect fixes for problems found after a release and to get them to users quickly.

Still, there are places where more formalized regression testing could be helpful. Your editor has, over the years, heard a large number of presentations given by large "enterprise" users of Linux. Many of them expressed the same complaint: they upgrade to a new kernel (often skipping several intermediate versions) and find that the performance of their workloads drops considerably. Somewhere over the course of a year or so of kernel development, something got slower and nobody noticed. Finding performance regressions can be hard; they often only show up in workloads that do not exist except behind several layers of obsessive corporate firewalls. But the fact that there is relatively little testing for such regressions going on cannot help.

Recently, Mel Gorman ran an extensive set of benchmarks on a set of machines and posted the results. He found some interesting things that tell us about the types of performance problems that future kernel users may encounter.

His results include a set of scheduler tests, consisting of the "starve," "hackbench," "pipetest," and "lmbench" benchmarks. On an Intel Core i7-based system, the results were generally quite good; he noted a regression in 3.0 that was subsequently fixed, and a regression in 3.4 that still exists, but, for the most part, the kernel has held up well (and even improved) for this particular set of benchmarks. At least, until one looks at the results for other processors. On a Pentium 4 system, various regressions came in late in the 2.6.x days, and things got a bit worse again through 3.3. On an AMD Phenom II system, numerous regressions have shown up in various 3.x kernels, with the result that performance as a whole is worse than it was back in 2.6.32.

Mel has a hypothesis for why things may be happening this way: core kernel developers tend to have access to the newest, fanciest processors and are using those systems for their testing. So the code naturally ends up being optimized for those processors, at the expense of the older systems. Arguably that is exactly what should be happening; kernel developers are working on code to run on tomorrow's systems, so that's where their focus should be. But users may not get flashy new hardware quite so quickly; they would undoubtedly appreciate it if their existing systems did not get slower with newer kernels.

He ran the sysbench tool on three different filesystems: ext3, ext4, and xfs. All of them showed some regressions over time, with the 3.1 and 3.2 kernels showing especially bad swapping performance. Thereafter, things started to improve, with the developers' focus on fixing writeback problems almost certainly being a part of that solution. But ext3 is still showing a lot of regressions, while ext4 and xfs have gotten a lot better. The ext3 filesystem is supposed to be in maintenance mode, so it's not surprising that it isn't advancing much. But there are a lot of deployed ext3 systems out there; until their owners feel confident in switching to ext4, it would be good if ext3 performance did not get worse over time.

Another test is designed to determine how well the kernel does at satisfying high-order allocation requests (being requests for multiple, physically-contiguous pages). The result here is that the kernel did OK and was steadily getting better—until the 3.4 release. Mel says:

This correlates with the removal of lumpy reclaim which compaction indirectly depended upon. This strongly indicates that enough memory is not being reclaimed for compaction to make forward progress or compaction is being disabled routinely due to failed attempts at compaction.

On the other hand, the test does well on idle systems, so the anti-fragmentation logic seems to be working as intended.

Quite a few other test results have been posted as well; many of them show regressions creeping into the kernel in the last two years or so of development. In a sense, that is a discouraging result; nobody wants to see the performance of the system getting worse over time. On the other hand, identifying a problem is the first step toward fixing it; with specific metrics showing the regressions and when they first showed up, developers should be able to jump in and start fixing things. Then, perhaps, by the time those large users move to newer kernels, these particular problems will have been dealt with.

That is an optimistic view, though, that is somewhat belied by the minimal response to most of Mel's results on the mailing lists. One gets the sense that most developers are not paying a lot of attention to these results, but perhaps that is a wrong impression. Possibly developers are far too busy tracking down the causes of the regressions to be chattering on the mailing lists. If so, the results should become apparent in future kernels.

Developers can also run these tests themselves; Mel has released the whole set under the name MMTests. If this test suite continues to advance, and if developers actually use it, the kernel should, with any luck at all, see fewer core performance regressions in the future. That should make users of all systems, large or small, happier.

