Replacement of deprecated kernel APIs

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

The kernel community tries to never change the user-space API in ways that will break applications, but it explicitly allows any internal API to be changed at any time if a solid technical reason to do so exists. But that doesn't mean that such changes are easy to do. At the 2018 Kernel Maintainers Summit, Kees Cook led a discussion on the challenges he has encountered when trying to effect large-scale API changes and what might be done to make such changes go more smoothly.

There are, Cook said, two common ways of doing a large API transition: too fast and too slow. As an example of the former, he mentioned the timer initialization change, which took three development cycles to prepare and he said, threatened to give him repetitive strain injuries. When changes are done quickly then, as far as the rest of the community is concerned, thousands of patches tend show up at once. Those patches tend to not see the light of day until they are thought to be ready, and they can result in a lot of merge conflicts once they surface; as a result, these patches often do not get enough testing before going into the mainline.

Doing things too quickly is almost always a bad idea, he said, but it's not always clear what the best way to proceed is. There was some talk of adding Coccinelle scripts (which are often used to make such changes in the first place) to linux-next for testing. The 0day testing system already runs the existing scripts, so it would not be too hard to add another for use in linux-next only.

On the other extreme, some changes can take years to complete; this process is too slow and can be painful as well. If an old API hangs around for years, new users will be continually introduced, making overall forward progress hard. He has written a "deprecated APIs" document that is set to enter the mainline in this merge window; he hopes it will help developers to avoid using APIs that are on the verge of removal.

In the past, deprecated APIs have been marked with the __deprecated attribute, which generated warnings when callers of that API were compiled. The warnings were voluminous and tended to overwhelm any useful information, so they were turned off a little while back. Cook does not want to turn them back on, but he does think that the __deprecated marker serves as useful code documentation, so he would like to see developers continue to add them, which has stopped happening.

There was some talk about putting checks into the checkpatch.pl script to warn about adding new calls to deprecated APIs. That would be useful in some settings, but it cannot catch everything; the use of variable-length arrays (VLAs) is an example where checkpatch.pl is not able to help. Herbert Xu said that the 0day tester only sends notifications for new warnings added by a patch, so the __deprecated warnings could be usefully turned on there, even if they are off for everybody else.

Linus Torvalds asked about which API changes are pending now. Cook responded that the VLA removal work is just about done, but Arnd Bergmann said that he has just discovered a few more that need to be dealt with. There is the marking of implicit fall-through code in switch statements; that is an example of a slow change that has a while to go yet. It may be at the point where warnings for unannotated fall-throughs could be turned on in testing settings, at least.

The conversation shifted to the elimination of BUG() and BUG_ON() calls. These functions will kill the entire machine, making any sort of recovery impossible and debugging difficult; adding new calls risks provoking Torvalds to violate his recent conduct-related promises. In almost all cases, one of the variants of WARN_ON() should be used instead, he said; the most paranoid users can set the "panic on warn" command-line option if they really do not want the system to continue after a warning is issued. Perhaps, he suggested, current BUG() calls could all be turned into warnings now.

Cook responded that callers to BUG() do not expect that call to return, so surprising things could happen if its behavior is changed. Torvalds said that such code is buggy, but this also shouldn't be a problem since that code is not being hit in current kernels anyway (or users would be complaining). Ted Ts'o suggested explicitly deprecating BUG() and replacing it with a call whose behavior is configurable; Cook said that he already has such a thing in the form of a function called check_data_corruption() , which callers have to be prepared to see a return from.

Christoph Hellwig said that the current BUG() behavior can be useful in a number of settings; it produces useful information in crash dumps or on a serial console, for example. He suggested that there is space for three types of behavior: emit a warning, kill the machine outright, or warn by default but be configurable to crash the system instead. The third behavior is the one that is missing now. Torvalds was adamant, though, that he does not want kernel developers to have the option of killing the machine; it's not something that most users have any way of coping with. Even he doesn't use a serial console anymore, so when the machine stops, he gets upset.

Bergmann, having done a quick search, noted that there are around 10,000 BUG() and BUG_ON() calls in the kernel now. Downgrading those to warnings would be likely to add a lot of new security holes. Torvalds countered that those calls are already a security hole, but others pointed out that they are currently a denial-of-service hole rather than something worse.

Greg Kroah-Hartman raised the problem of core kernel API code that will use WARN_ON_ONCE() to complain about bad usage; that will not generate the desired result if WARN_ON_ONCE() is configured to crash the machine. He was told that the code should just call pr_warn() instead, and that the called function should return an error in such situations. It was generally agreed that any WARN_ON() or WARN_ON_ONCE() calls that can be triggered from user space need to be fixed.

At that point, the group ran low on things to talk about, as evidenced by the fact that the conversation turned to the 80-character line limit. Some developers would like to see it increased; others disagree. Everybody seemed to agree, though, that fixing line-length problems to keep checkpatch.pl happy leads to worse code most of the time.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the Maintainers Summit.]

