This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Two separate talks, at two different venues, give us a look into the kinds of testing that the Intel graphics team is doing. Daniel Vetter had a short presentation as part of the Testing and Fuzzing microconference at the Linux Plumbers Conference (LPC). His colleague, Martin Peres, gave a somewhat longer talk, complete with demos, at the X.Org Developers Conference (XDC). The picture they paint is a pleasing one: there is lots of testing going on there. But there are problems as well; that amount of testing runs afoul of bugs elsewhere in the kernel, which makes the job harder.

Developing for upstream requires good testing, Peres said. If the development team is not doing that, features that land in the upstream kernel will be broken, which is not desirable. Using continuous-integration (CI) along with pre-merge testing allows the person making a change to make sure they did not break anything else in the process of landing their feature. That scales better as the number of developers grows and it allows developers to concentrate on feature development, rather than bug fixing when someone else finds the problem. It also promotes a better understanding of the code base; developers learn more "by breaking stuff", which lets them see the connections and dependencies between different parts of the code.

CI also helps keep the integration tree working at all times. That means developers can rebase frequently so that their patches do not slowly fall behind as the feature is developed. They don't have to fight with breakage in the integration tree because it is tested and working. The rebases are generally small, since they are frequent, which allows catching and fixing integration and other problems early on.

There are a number of objectives of the CI testing that Intel does. It is meant to provide an accurate view of the current state of the hardware and software. In order to do that, there are some requirements on the test results. They must include all of the information needed to diagnose and reproduce any problems found including the software and hardware configurations and log files. They also need to run quickly so that developers can get fast feedback on a proposed change. To that end, Intel has two levels of testing, one that gives results in 30 minutes or less and another that is more complete but takes up to half a day to complete.

The results need to be visible. Publishing the test results on a web site would work, he said, but it is better to make the results hard to miss. If the patch being tested is something that was posted to a mailing list, the results should be posted as a reply. "Put the data where the developer is." In addition, false positive test failures make developers less apt to believe in the test results, so they must be avoided. Any noise in the results needs to be aggressively suppressed.

Vetter went into some detail of the actual tests that are being run. The fast test suite uses IGT, which consists of mostly Intel-specific tests, on actual hardware. There are 250 test cases that run on 30 machines. That test takes about ten minutes to run, but may not complete that quickly depending on the load on the testing systems.

The next step is the full test suite, which takes six hours of machine time. It tests against multiple trees, including the mainline, linux-next, and various direct rendering manager (DRM) subsystem trees (drm-tip, fixes, next-fixes). Those tests run with lockdep and the kernel address sanitizer (KASAN) enabled. There is a much bigger list of a few thousand test cases that gets run as well, he said. The tests, results, and more are all available from the Intel GFX CI web page.

Pre-merge testing is one of the more interesting and important parts of the test system. It picks up patches from the mailing lists using a modified version of Patchwork and runs the test suites on kernels with those patches. If the fast tests pass, that kicks off the full test; meanwhile the patches can start to be reviewed. Metrics that measure things like test system usage, test latency compared to the 30-minute and six-hour objectives, bugs found, and so on are also measured and reported. The test matrix (example) was "much smaller and full of red" a year ago, Vetter said. "CI is definitely awesome".

Peres filled in some more of the details. He said there were 40 different systems (up from Vetter's 30) and 19 different graphics platforms. Those range from Gen 3 from 2004 to the announced, but not yet available, Gemini Lake. There are a number of "sharded" machines, which are six or eight duplicate systems that can be used to parallelize the testing. The number of tests run has increased from around 22,000 per day in August 2016 to 400,000 per day in August 2017.

There are some things that have been learned along the way, Vetter said. Noise, in the form of false positives, will kill any test system. "Speed matters"; if it takes three days to get initial results, people will ignore them. In addition, Vetter said that pre-merge testing is the only time to catch regressions. Once a feature has been merged, it is impossible to get it reverted because people get attached to their work, but they don't always fix any regressions in a timely manner.

Linux-next is difficult to test with because it requires lots of fixup patches and reverts to get to something that will function. Part of the Intel testing does suspend-resume cycles on the machines which finds a lot of regressions in other subsystems, Vetter said. Greg Kroah-Hartman suggested posting those regressions as a reply to the linux-next announcement, but Vetter said there would be way too much noise with that approach.

Beyond that, trees like linux-next make bisecting problems way too hard. It takes "a lot of hand holding" to make bisect work so that these regressions can be found. It takes more work to ask the subsystem maintainer to get them fixed, so the Intel team ends up reverting things or changing the configuration to avoid those problems. They do try to report them to the maintainers, but the root of the problem is that some subsystem maintainers put unready code into linux-next in the hopes that others will test it for them; that makes it less than entirely useful as an integration tree.

The problems are generally outside of the graphics subsystem and are often exposed by the thousands of suspend-resume cycles that are part of the Intel graphics testing. The 0-Day test service does not do suspend and resume testing, though Kroah-Hartman suggested that it be added if there is an automatic way to test it. Regressions are hard to get fixed even in graphics, Vetter said, and reverts are difficult to sell. That's why pre-merge testing to find problems before the code gets merged is so important.

Peres also had a list of lessons learned, some of which, unsurprisingly, overlapped Vetter's. For one thing, kernel CI testing is unlike CI testing for any other project because it requires booting the machine, sometimes with broken kernels. In the year since the project started, they have realized that anything that is not being continuously tested is likely broken. Once again, this is something that is part and parcel of kernel testing because there are so many different configuration options, many of which cannot even be tested without the right hardware.

New tools were needed as Bugzilla is not a good tool to track test failures. Peres has been working on a CI bug log tool to fill that gap. He hopes to release the code for it by the end of the year once the database format has stabilized. It is also important that the developers own the CI system and that the CI team works for them. It should not be a separate team that reports to a different manager outside of the development team. As the developers start to see the value of the CI system, they will suggest improvements to the system and the tests that will help make the testing better.

Other graphics teams that have an interest in replicating the Intel work should have an easier time of it because much of the groundwork has been laid, Peres said. There is still a need for infrastructure and hardware, of course, along with software for building kernels, deploying them, scheduling test jobs, and the like. Several components are particularly important, including ways to power cycle systems and to resume them from suspend—an external watchdog will also help by restarting systems that are no longer responsive. There is a need for qualified humans both to create bugs and to respond quickly to those bugs and fix them.

There are a number of challenges that are specific to CI testing for the kernel, Peres said. The first is that various kernels will not boot or function properly; it could be the network drivers, filesystem corruption, or something else that may be difficult to automatically diagnose. Getting tracing and log information out of the system, especially during suspend and resume failures or if there is a random crash while running the tests, can be difficult. Using pstore for EFI systems and serial consoles for the others will provide a way to get information out of a broken system. Note that memory corruption can lead to all sort of nastiness, including trashing the partitions of the disk, so an automated way to re-deploy the system will be needed.

Slides [PDF] and a YouTube video of Peres's presentation are available for interested readers.

[I would like to thank the Linux Foundation and the X.Org Foundation for travel assistance to Los Angeles for LPC and Mountain View for XDC.]

Comments (33 posted)

What kind of cell phone would emerge from a concerted effort to design privacy in from the beginning, using free software as much as possible? Some answers are provided by a crowdfunding campaign launched in August by Purism SPC, which has used two such campaigns successfully in the past to build a business around secure laptops. The Librem 5, with a five-inch screen and radio chip for communicating with cell phone companies, represents Purism's hope to bring the same privacy-enhancing vision to the mobile space, which is much more demanding in its threats, technology components, and user experience.

The abuse of mobile phone data has become a matter of worldwide concern. The capture and sale of personal data by apps is so notorious that it has been covered in USA Today; concerns over snooping contribute to the appeal of WhatsApp (which has topped 1.3 billion users) and other encrypted and privacy-conscious apps. But apps are only one attack vector. I got in touch with Todd Weaver, founder and CEO of Purism, to find out what the company is doing to plug the leaks in mobile devices.

Many free operating systems have been developed for mobile devices; the best-known is probably the Ubuntu phone, which never really got off the ground. A combined approach with both hardware and software to maximize security, however, is a new idea. Purism is built on a philosophy of protecting its users, and has registered as a social purpose corporation to emphasize its commitment to benefiting customers. Although less than three weeks are left in the Librem 5 campaign, Weaver is confident that it will reach its goal. The company also recently announced that it will port the KDE Plasma framework to the phone, in addition to its support for GTK+ and GNOME.

The design principles for the Librem 5 provide a valuable model for building privacy by design into devices. This article looks at the various levels of privacy protection, from bottom to top.

Hardware protections

Weaver explained to me that the radio components with which current phones communicate to the baseband provider (AT&T, for instance) share a chip with the phone's CPU. This gives complete access to everything the user does on the phone to the mobile provider—and to anyone else who gets access to its chip, whether government agents with warrants or malicious intruders.

We don't know whether baseband providers exploit the unprecedented access they have over user's private data, but Weaver plans on offering more peace of mind by separating the CPU running apps from the CPU that communicates with the baseband provider. Thus, the provider has no access to app data unless the data is transmitted unencrypted. Interestingly, this architectural choice hearkens back to early cell phones, which also ran the apps and baseband on separate CPUs.

Numerous reports describe malware that secretly records user activity from the device's camera or GPS. There is little one can do to protect against such attacks on current devices. Some laptops contain physical, hardware switches that allow the user to turn off WiFi, but they are becoming less common. And even the laptops that offer such switches do a halfhearted job of disabling the device, simply setting a standard software bit that disables the connection between the WiFi device and the PCI bus. A malicious app might be able to turn access back on. In fact, a simple software bug can leave the hardware capabilities vulnerable.

Purism plans to add three or four physical kill switches to the side of the Librem 5 phone. The architecture envisioned by Weaver is simple: the switches will cut power to the devices, making it look as if the devices don't exist at all. No software can turn the device on once the user sets the switch. The user can verify this because the device disappears from the output of commands such as lspci and lsusb . There will be one switch to turn off WiFi and Bluetooth, another for the radio to the baseband provider, and a third for the camera. The camera switch is particularly useful on a mobile device because few people want to put tape over their cameras on such devices. A possible fourth switch will turn off GPS.

Purism will offer a high-resolution photo of the Librem 5 motherboard, so that a user can compare it to the motherboard in their device and catch attacks where someone substitutes a different motherboard. Although the Librem 5 is not open hardware, Purism may open the schematics of older models at some point.

Trusted systems are a double-edged sword, widely distrusted in the free software community because of their potential for heavy-handed copyright restrictions and disabling access to software that the manufacturer wants to suppress for any reason. Yet free software advocates understand that in the right hands, trusted systems offer protection from malicious apps. A Trusted Platform Module (TPM) is not part of Purism's current initiative, but it plans to support TPM in the future, while putting control over keys in the hands of the user.

Software protections

Purism's operating system, called PureOS, will be based on the Debian distribution. Purism staff (which contains several Debian developers) will keep PureOS in sync with future releases of Debian. If the company makes any enhancements to the Linux kernel or the Debian distribution that would be of value to the community, Purism will contribute them back upstream. Except for WiFi and Bluetooth, which may require a binary driver for the form factor of the Librem 5 phone, Purism plans to use free software for all the devices on the phone. The company may also be able to reverse-engineer and free some drivers or firmware, as it did for its laptops.

Purism handles data security by making sure to store data in an encrypted format. For all communications, including phone calls, it provides the popular free Matrix software. Any two correspondents who use Matrix-based applications, such as the Riot chat tool, have strong privacy guarantees. These applications can recognize when you are communicating with a correspondent who doesn't use Matrix, and fall back on communicating in the clear. So you can still call a friend or company who doesn't have a secure client; you just don't get the protection of encryption. The Librem 5 could also potentially join a mesh network of secure devices that communicate without the need for centralized, proprietary network providers.

Although operating systems enforce isolation between processes, graphical user interfaces (GUIs) tend to be more lax, including that artifact of the permissive 1980s, the X Window System. One X client can easily view data passed to others. The free-software community, including GNOME and KDE, is therefore moving to a more secure display manager called Wayland, which is more careful to check which window is meant to receive input and to ensure that the input goes only to that window. Practically speaking, app isolation means that an exploit in an app cannot compromise other apps. Thanks to Wayland, isolation is the default on the Librem 5.

Business model

Although Purism uses crowdfunding campaigns for major new ventures such as the Librem 5, it has developed a robust business plan that supports the continued maintenance and development of software through the sales of its hardware along with some angel investment (often from its own customers). The company is three years old and gives back money to important middleware components like GNOME as well as to app developers for apps requested by users.

Purism will set up its own software store, offering free software apps that have been vetted to ensure they uphold the company's commitment to privacy. Other app developers can offer apps outside the store, and users will still benefit from app isolation and the other protections in the Librem 5. There is a precedent for a secure app store in CopperheadOS, but it is based on Android, which does not contain the same protections that Purism plans to build into the Librem 5.

I asked Weaver whether he is worried about government interference. He pointed out that competing forces pull both governments and corporations in different directions: while some actors want to snoop on the public, others recognize the need to protect their own communications and the value of being part of a community that protects its data. It's worth remembering the role the Navy played, for instance, in the development of the Tor onion routing network—the Navy didn't create Tor, as is sometimes claimed, but did offer funding. Weaver is already working with sympathetic government agencies that want his equipment. We can expect a secure phone to be greeted enthusiastically from many quarters.

Comments (46 posted)

The GNU Privacy Guard (GnuPG) is one of the fundamental tools that allows a distributed group to have trust in its communications. Werner Koch, lead developer of GnuPG, spoke about it at Kernel Recipes: what's in the new 2.2 version, when older versions will reach their end of life, and how development will proceed going forward. He also spoke at some length on the issue of best-practice key management and how GnuPG is evolving to assist.

It is less than three years since attention was focused on the perilous position of GnuPG; because of systematic failure of the community to fund its development, Koch was considering packing it all in. The Snowden revelations persuaded him to keep going a little longer, then in the wake of Heartbleed there was a resurgent interest in funding the things we all rely on. Heartbleed led to the founding of the Core Infrastructure Initiative (CII). A grant from CII joined commitments from several companies and other organizations and an upsurge in community funding has put GnuPG on a more secure footing going forward.

GnuPG 2.1 has been around for nearly three years, so it's time for something new: GnuPG 2.2 was released in August, with the bug-fix 2.2.1 version following in mid-September. GnuPG 2.0 will reach its end of life around the end of 2017. Support for PGP-2 keys has been removed from GnuPG 2.2 because of their age and their heavy use of the MD5 hash. That said, the venerable GnuPG 1.4 will continue to be maintained for its portability to pre-POSIX systems and because people with data encrypted with old PGP-2 keys will want some way to decrypt that data going forward. The GnuPG 2.3 development branch will be opened, to go forward hand-in-hand with proposed improvements to the underlying standard including a SHA-256-based fingerprint and new default algorithms.

Koch then moved into Elliptic Curve Cryptography (ECC), which he discussed at some length. RSA, he said, is not likely to stay secure for much longer without really large keys. Support for 4096-bit RSA keys has been in GnuPG for some time, but Koch contends that real security will require 16Kb keys; that makes keys, fingerprints, and signatures all unusably long, particularly for embedded devices and hardware security modules (HSMs).

So instead he's moving toward ECC ciphers, which are well-researched — more so than RSA, according to Koch. Instead of key size, ECC applications speak of using different curves, and while for RSA the same implementation can be used for many key sizes, with ECC each curve needs to be implemented separately. Some of these curves (Koch named the NIST curves required for NSA Suite B) have a bad reputation, as described in a talk [PDF] by Daniel J. Bernstein and Tanja Lange, for example. Some, including the German government, would prefer the Brainpool curves [PDF] but these, too, are over a decade old. RFC 7748 is moving us toward the current best-available curves, he said, including Curve25519, Curve448 (called by its author the Goldilocks curve because of a deep connection to the golden ratio, φ), and some variants; GnuPG will be using these curves going forward.

Koch showed examples of digital signatures of comparable security, one made with RSA-4096 and one with Ed25519; the latter is about a third of the length of the former. Performance of these new ECC curves is also pretty good. HSM timing data showed that RSA is about 60 times slower than Ed25519 for signing. In response to a question from the audience, Koch agreed that since the RSA timings were from one HSM and the Ed25519 timings from another it is difficult to compare them directly. Nevertheless, the timings from the RSA HSM showed that a doubling of the key length increased the time required to sign nearly six-fold, and the time for verification even more. The timings from the Ed25519 HSM were agreeably small (all sub-50ms).

Security of the private key is clearly much on Koch's mind at the moment. GnuPG 2.2 mandates the use of gpg-agent because private key handling has moved entirely out of gpg itself and into the agent. Most of the complex stuff that gpg does with keys is unknown by the agent as gpg-agent only performs private-key operations (which, incidentally, it can also do for SSH). This enables some clever tricks such as running gpg on a remote server, perhaps to sign a large tarball, but running gpg-agent locally on a desktop so that all private key material and operations are handled locally. Such an architecture also lends itself to the use of HSMs to securely store private key material.

Historically, using HSMs has meant smartcards, but Koch noted that even though these implement an open standard they do it with proprietary code, leaving us no current choice but to evaluate the vendors for trustworthiness, and then choose a product. Though this is what Koch himself does now, he'd like more freedom, and is working to bring to market the Gnuk token. It is based on an STM32 microcontroller, running a fully-free stack and is designed to be the HSM of choice for GnuPG, but bringing it to the European market is proving troublesome. CE certification isn't too bad, he said, but WEEE compliance is proving painful; he's working with Kernel Concepts to try to turn the Gnuk token into something you can buy for €30-35 in its FLOSS-Shop store.

The management of public keys is much on his mind as well. If you want to be completely sure that a key belongs to a particular person, you get a copy of their public key directly from them, or you seek a copy signed by a trusted intermediary. If you don't do that, then right now your only option is the public keyservers. Koch notes we have a good network of these, but it's hard to have any kind of trust in keys on the public servers, absent a fortuitous intermediary signature on such a key or a key fingerprint provided out-of-band by your correspondent. If you search for him on the public servers, Koch said, you get back two keys, one of which is a deliberate fake (I tried on the MIT keyserver, and found three equally plausible-looking keys).

The problem is systemic: the web of trust, he feels, is inherently broken. It is only explicable to geeks, and not to all of them, it publishes a global social graph, because signatures on keys imply physical meetings on known dates, and it doesn't scale. His preference for general public key handling is Trust On First Use (TOFU). It's easy to explain (we've all been doing it with SSH for years), and it doesn't require any central key-distribution authority, the existence of which he feels would be a betrayal of one of the core PGP principles. It isn't yet GnuPG's default model for trust, but it may become so in the future.

If you accept that TOFU is a pragmatically acceptable practice, particularly for securely communicating with new correspondents, then can we do better than the public keyservers for getting their keys? Koch feels that the right place for on-demand key distribution is the email provider. Your email provider has some idea who you are, and you have some kind of contractual relationship with them, however tenuous. In turn, their control over your email address's domain name is reasonably easy to establish through pre-existing channels such as DNS (especially if it's all signed with DNSSEC). So your provider is in the right position to run a server that serves keys, but only for domain names for which they provide email, which preserves the decentralization that Koch finds so desirable. The nascent Web Key Service (WKS) protocol provides a mechanism for provisioning, populating, and querying such a server, and it is supported in GnuPG from 2.1.15 onward. There exist some nice HOWTOs for both setting up the server, and using the service with one privacy-aware MUA. Readers keen to experiment with this, particularly at the key-publication end, will probably need to nag their email provider if they are not themselves lucky enough to control their email domain.

Quite a few of Koch's slides were given over to the new, scripting-friendly --quick-* options, but many of these were present in GnuPG 2.1 and so have been covered here before. A couple of his newer examples are worthy of closer examination. At the moment, if you have someone's public key and want to encrypt data using it, you must import it into your keyring, then encrypt. But often the only time I intend to use a particular key is for this one encryption; I'm not going to use it again and I don't want it rattling around in my keyring forever. The use of the -f FILE_WITH_KEY option to read a key from a file without adding it to a local keyring allows me to handle this gracefully, with:

gpg -f FILE_WITH_KEY -e DATAFILE

gpg --import-options show-only --import FILE

gpg --quick-sign-key FINGERPRINT [NAMES]

Despite his earlier comments about the web of trust, Koch also recommended the following commands as being particularly useful for the aftermath of key-signing parties. This command will list the keys in a file without actually importing them:Once the keys to be signed have been imported, the following command will sign a key without all of the usual interaction that key signing requires, so it is ideal for building a key-signing script:

In summary, Koch said that GnuPG 2.2 brings modern algorithms, better scriptability, and support for automated key discovery based on a correspondent's email address (if we talk to our providers). He added a couple of warnings (that Debian has a fairly old version, and that Ubuntu LTS has 2.1.11 where some things, including encryption using ECC keys, are broken).

The talk was followed by the traditional charity auction, which this year was in support of GnuPG. Michael Kerrisk, Aurélien Rougemont, Sylvain Baubeau, and Eric Leblond each donated a large sum of money in exchange for a 1.5l bottle of beer which all the speakers had signed. Peter Senna Tschudin went well into three figures for the 2l ornate flask of beer, also signed. All five deserve mention in dispatches; this will have to do.

[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]

Comments (15 posted)

An attacker who seeks to compromise a running kernel by overwriting kernel data structures or forcing a jump to specific kernel code must, in either case, have some idea of where the target objects are in memory. Techniques like kernel address-space layout randomization have been created in the hope of denying that knowledge, but that effort is wasted if the kernel leaks information about where it has been placed in memory. Developers have been plugging pointer leaks for years but, as a recent discussion shows, there is still some disagreement over the best way to prevent attackers from learning about the kernel's address-space layout.

There are a number of ways for a kernel pointer value to find its way out to user space, but the most common path by far is the printk() function. There are on the order of 50,000 printk() calls in the kernel, any of which might include the value of a kernel pointer. Other places in the kernel use the underlying vsprintf() mechanism to format data for virtual files; they, too, often leak pointer values. A blanket ban on printing pointer values could solve this problem — if it could be properly enforced — but it would also prevent printing such values when they are really needed. Debugging kernel problems is one obvious use case for printing pointers, but there are others.

The approach that has been taken in the kernel is to try to identify the places where kernel pointers are printed and, perhaps, censor that information on its way to user space. The special " %pK " formatting directive (added in 2011) should be used to print kernel pointers; the formatting code will, among other things, be sure to format them correctly regardless of the architecture the kernel is running on. This directive also interacts with the kptr_restrict sysctl knob, though. If that knob is set to zero (as it is by default), kernel addresses are printed unchanged. Setting it to one will cause kernel pointers to be printed as all zeroes unless the current process is running with privilege; setting it to two wipes all kernel addresses unconditionally.

One can immediately pick out some shortcomings in this scheme. It is an opt-in mechanism that depends on all kernel developers properly marking the places where they print kernel pointers. It depends on the credentials of the running process; that makes sense for situations like reading a /proc file (which doesn't use printk() but does use the underlying formatting support), but it's less useful for the many places in the kernel that call printk() in response to an asynchronous event. It also can allow other types of possibly sensitive addresses (physical addresses, for example) to be exposed.

Tobin Harding recently tried to improve the situation with a patch set tightening up the printing of pointer values in general. It made a few specific changes:

It adds two new values for kptr_restrict . A setting of three will prevent the printing of pointer values with the unadorned " %p " directive. In theory, no kernel pointers should be printed that way, but the real world is not so ideal. Setting kptr_restrict to four will also prevent the printing of physical address values (those printed with " %pa ", " %pr ", and " %pR ").

. A setting of three will prevent the printing of pointer values with the unadorned " " directive. In theory, no kernel pointers should be printed that way, but the real world is not so ideal. Setting to four will also prevent the printing of physical address values (those printed with " ", " ", and " "). The default value of kptr_restrict is changed to four as a way of preventing address leaks during the early boot process.

is changed to four as a way of preventing address leaks during the early boot process. A new " %pP " directive indicates a pointer value that should always be printed regardless of the setting of kptr_restrict . The initial use for this directive is in the printing of stack traces.

" directive indicates a pointer value that should always be printed regardless of the setting of . The initial use for this directive is in the printing of stack traces. There is also a new unconditional version of " %pa " (" %paP ", along with " %padP " for DMA addresses and " %papP " as a synonym for " %paP "). Some user-space UIO drivers need that information.

There were some immediate concerns about defaulting kptr_restrict to the most restrictive setting. It seems certain to make life difficult for developers trying to debug problems that show up early in the bootstrap process. As Linus Torvalds noted, that could lead to developers circumventing the mechanism entirely by using something like " %x " to print pointer values. Options like setting the default value in the kernel configuration or on the command line were discussed, but the discussion quickly took a different turn.

Torvalds also complained that the entire kptr_restrict mechanism is the wrong approach to the problem. The read-time capability test does not always make sense and, he said, a global switch is the wrong way to handle the problem. Attempts to make that switch more restrictive by default have run into trouble in the past and have been backed out as a result. The proper solution, Torvalds said, is to simply fix all of the places in the kernel that leak addresses.

That is, of course, easier said than done, as Jordan Glover remarked: "If we knew where those leaks are hiding they will be fixed already." It is better, he said, to assume that there will be leaks in the kernel and try to block them all at once. But Torvalds believes that the same effect as a restrictive kptr_restrict setting could be achieved by searching for (and fixing) every use of unadorned " %p " directives in the kernel. It would be a fair amount of work, but perhaps much of it could be scripted.

He also suggested that many of the problems could be found by searching for addresses showing up in actual log files. Companies like Google, he said, probably have a lot of kernel logs sitting around; searching them for addresses should quickly reveal the real problems, which can then be fixed. He demonstrated with the log from his own system, which included a physical address printed with " %x " and which, thus, would not have been redacted by the proposed patches. And there are even more paths for kernel addresses to leak; he mentioned a case where the netfilter code was using an address in a slab name, which then showed up in the kernel's slab statistics.

Some patch submitters would have been dismayed by this response. Harding, instead, responded that this project "sounds like just the job for an upcoming kernel hacker, with a lot of time and not much experience, to do something laborious that no one else wants to do and learn a bunch about the kernel." It seems that he is that kernel hacker; he went on to propose dropping the patch set in favor of tracking down and fixing the actual leaks, and added "I'm super keen to work".

So that seems to be the likely direction for this work. Some of the existing patches will probably get into the kernel eventually, though; there is value in identifying the types of all addresses being printed. Maybe, someday, the bare " %p " directive will disappear. That will not happen in the immediate future, though; there are a lot of call sites to fix first.

Comments (6 posted)

The kernel's timer interface has been around for a long time, and its API shows it. Beyond a lack of conformance with current in-kernel interface patterns, the timer API is not as efficient as it could be and stands in the way of ongoing kernel-hardening efforts. A late addition to the 4.14 kernel paves the way toward a wholesale change of this API to address these problems.

It is worth noting that the kernel has two core timer mechanisms. One of those — the high-resolution timer (or "hrtimer") — subsystem, is focused on near-term events where the timer is expected to run to completion. The other subsystem is just called "kernel timers"; it offers less precision but is more efficient in situations where the timer will probably be canceled before it fires. There are many places in the kernel where timers are used to detect when a device or a network peer has failed to respond within the expected time; when, as usual, the expected response does happen, the timer is canceled. Kernel timers are well suited to that kind of use. The work at hand focuses on that second type of timer.

Kernel timers are described by the timer_list structure, defined in <linux/timer.h> :

struct timer_list { unsigned long expires; void (*function)(unsigned long); unsigned long data; /* ... other stuff elided ... */ }

The expires field contains the expiration time of the timer (in jiffies); on expiration, function() will be called with the given data value. It is possible to fill in a timer_list structure manually, but it is more common to use the setup_timer() macro:

void setup_timer(timer, function, data);

There are a number of issues with this API, as argued by Kees Cook. The data field bloats the timer_list structure unnecessarily and, as an unadorned unsigned long value, it resists any sort of type checking. It is not uncommon for callers to cast pointer values to and from this value, for example. For these reasons, it is far more common in current kernel APIs to dispense with the data field and, instead, just pass a pointer to the relevant structure (the timer_list structure in this case) to the callback. Likely as not, that structure is embedded within a larger structure containing the information the callback needs anyway, so a simple container_of() call can replace the casting of the unsigned long value.

As might be expected, though, Cook has concerns about this API that go beyond matching the current kernel style. One of those is that a buffer overflow in the area of a timer_list structure may be able to overwrite both the function pointer and the data passed to the called function, allowing arbitrary calls within the kernel. That, naturally, makes timer_list structures interesting to attackers, and explains why Cook has been trying to harden timers for a while. The prototype of the timer callback, containing a single unsigned long argument, is also evidently an impediment to "future control flow integrity work". It would be better if the callback had a unique prototype that was visibly different from all of the other kernel functions taking an unsigned long argument.

Cook has been working on changes to the timer interface for a while in an attempt to address these issues. The core idea is simple: get rid of the data value and just pass the timer_list structure to the timeout function. The actual transition, though, is complicated by the existence of 800 or so setup_timer() call sites in the kernel now. Trying to change them all at once would not be anybody's idea of fun, so a phased approach is needed.

In this case, Cook has introduced a new function for the initialization of timers:

void timer_setup(struct timer_list *timer, void (*callback)(struct timer_list *), unsigned int flags);

For the time being, timer_setup() simply stores a pointer to timer in the data field. Note that the prototype of the callback has changed to expect the timer_list pointer.

With that function in place, calls to setup_timer() can be replaced at leisure, as long as each corresponding timer callback function is adjusted accordingly. For the most part, as can be seen in this example, the changes are trivial. Many timer callbacks already were casting the data value to a pointer to the structure they needed; they just need a one-line change to obtain that from the timer_list pointer instead. A new from_timer() macro has been added to make those conversions a bit less verbose.

The addition of timer_setup() was merged just prior to the 4.14-rc3 release — rather later in the release cycle than one would ordinarily expect to see the addition of new interfaces. The purpose of this timing was clear enough: it clears the way for the conversion of all of those setup_timer() calls, a task which, it is hoped, will be completed for the 4.15 kernel release. Once that is done, the underlying implementation can be changed to drop the data value and the setup_timer() interface can be removed entirely. At the end, the kernel will be equipped with a timer mechanism that is a little bit more efficient, more like other in-kernel APIs, and easier to secure.

Comments (15 posted)

While the 4.14 development cycle has not been the busiest ever (12,500 changesets merged as of this writing, slightly more than 4.13 at this stage of the cycle), it has been seen as a rougher experience than its predecessors. There are all kinds of reasons why one cycle might be smoother than another, but it is not unreasonable to wonder whether the fact that 4.14 is a long-term support (LTS) release has affected how this cycle has gone. Indeed, when he released 4.14-rc3 , Linus Torvalds complained that this cycle was more painful than most, and suggested that the long-term support status may be a part of the problem. A couple of recent pulls into the mainline highlight the pressures that, increasingly, apply to LTS releases.

As was discussed in this article, the 4.14 kernel will include some changes to the kernel timer API aimed at making it more efficient, more like contemporary in-kernel APIs, and easier to harden. While API changes are normally confined to the merge window, this change was pulled into the mainline for the 4.14-rc3 release. The late merge has led to a small amount of grumbling in the community.

The problem isn't necessarily the addition of timer_setup() which, on its own, cannot really break anything. But that addition has been followed by a series of conversions to the new interfaces, which are being sent to the relevant maintainers. Accepting a timer_setup() conversion into a maintainer tree will only work if that tree has timer_setup() itself; that implies that the maintainer tree must be current with the mainline as recently as 4.14-rc3. Many subsystem maintainers branch from the mainline around -rc1 or -rc2, so they won't be able to apply the conversion patches unless they perform a separate merge first. The merge is not usually hard, but subsystem trees containing "back merges" with the mainline can run into trouble during the merge window, so maintainers have understandably become leery of them.

In this case, the grumbling is already done, and the conversion to the new timer API can be expected to be completed on schedule in 4.15. And, perhaps more to the point, those who want to backport a bunch of conversions to 4.14 (so as to have them in a long-term supported kernel that is likely to be shipped in many mobile devices) will have a much easier task of it. It was never explicitly said that 4.14, in particular, was an important target for this work, but it seems unlikely that it wasn't in developers' minds.

In another case, things were more explicit. Thomas Gleixner recently sent a pull request for a significant refactoring of the watchdog timer subsystem; it was a reworked version of a patch set that had been refused by Torvalds during the merge window. Part of the reasoning for requesting a pull this late in the development cycle was a desire to get the work into this release in particular:

As 4.14 is a long term stable kernel, I prefer to have working watchdog code in that and the lockdep issues resolved. I wouldn't ask you to pull if 4.14 wouldn't be a LTS kernel or if the solution would be easy to backport.

Stable kernel maintainer Greg Kroah-Hartman complained about that request: "This is exactly what I did _NOT_ want to ever see happen when I did the 'let's announce the LTS kernels ahead of time'". He suggested that perhaps future long-term support kernels will return to post-release announcements. That notwithstanding, Torvalds pulled the changes for the 4.14-rc4 without comment.

This work almost certainly will not break the 4.14 kernel; it was essentially ready during the merge window. But it does show that the LTS release is motivating pull requests that might have otherwise waited another cycle. That is not how things were supposed to work; part of the idea behind a nine-week release cycle was that, since the cost of missing one cycle was minimal, there would no longer be any great incentive to hurry code into any particular release. It is clear, though, that this incentive has not entirely gone away; indeed, it may be getting stronger.

For those who are inside the kernel community, one development cycle looks much like the next. But, for those making use of the kernel, all kernel releases are decidedly not equal. The release that they actually plan to ship is the one that they care about. There has been a determined effort to encourage the industry to ship the LTS kernels in the hope of improving the support for deployed kernels in general. This effort has seen some success, which is a positive change, but it does tend to focus even more attention on the LTS releases. That can only result in more pressure to get features into those releases.

In a sense, the situation vaguely resembles how things worked before 2.6 came out: major kernel releases were separated by years, so there was immense pressure to get features in before the deadline. As the LTS kernels become more widely used, they start to look like the major releases of old. The LTS releases are the ones that everybody wants to get their features into, and they only happen once each year. Missing an LTS release means waiting a year for a feature to make it into the next LTS release and, probably, maintaining it out-of-tree for products shipped in the meantime. It's not surprising that the idea of getting code into the mainline sooner, even if it requires fixing later, has some appeal, but there is a cost to doing things that way. As Kroah-Hartman said: "We've been down this path before, and it was not good".

That said, the kernel development community has changed considerably since the adoption of the short release cycle. Code is generally of a much higher quality at the time it is merged into the mainline. So if a bit more of it is jostling to get into 4.14, the result may be a more turbulent development cycle. It should not, however, replicate the situation of 15-20 years ago, where a "stable" kernel release would require another year to truly stabilize. We are probably not at risk of repeating the misery of the early 2.x years.

In the cases described here, the quality of the code being merged is not in question. It is really just a matter of the timing, and the discussion wound down quickly. But this topic can be expected to return. Neither the pressure to get changes into LTS releases nor human desire to game the system will go away, even if the pre-announcement of LTS releases comes to an end.

Comments (13 posted)