This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A focus on privacy is a key feature being touted by a number of different projects these days—from KDE to Tails to Nextcloud. One of the biggest privacy leaks for most people is their phone, so it is no surprise that there are projects looking to address that as well. A new entrant in that category is eelo, which is a non-profit project aimed at producing not only a phone, but also a suite of web services. All of that could potentially replace the Google or Apple mothership, which tend to collect as much personal data as possible.

Eelo is the brainchild of Gaël Duval, who also founded Mandrake Linux and Ulteo. In a November 2017 blog post, he noted that he has exclusively used iPhones since 2007 and over the past few years he had migrated to macOS, while using Google services extensively on both. That didn't sit well with him:

But talking with friends this year, I realized that I had become lazy and that my data privacy had vanished. Not only I wasn't using Linux anymore as my main operating system, but I was also using a proprietary OS on my smartphone. And I was using Google more and more. Search of course, but also Google Mail, Google drive and Google docs. And Google Maps.

He looked at various existing free-software mobile phone options but, as is too often the case, it seems, found that nothing really fit his needs. For one thing, he is completely sold on the iOS-style interface, while many free-software choices lean more toward an Android-like interface. That led him to found eelo, which, based on his concerns about his privacy, would need to be far more than simply yet another mobile-phone user interface.

Trying to get iOS apps running on a non-Apple phone is obviously a non-starter, however. Trying to bootstrap an entire app ecosystem is likely somewhere between hard and impossible as well. The obvious choice is to support Android apps, which can be installed from alternative repositories (e.g. F-Droid and APKPure), rather than relying on the Google Play store (and the privacy and other implications that go along with that). Duval chose to base eelo on LineageOS, which is the open-source project that rose from the ashes of CyanogenMod. In addition, LineageOS has microG, which allows apps to use Google's Play Services API, but without the binary blob.

But if privacy is the goal, there is far more to it than supporting an app ecosystem (which may have numerous privacy-dubious apps in any case). Google is able to collect so much information from Android users because the apps communicate with the mothership: mail, contacts, search, browser auto-complete, maps, storage, and so on. Some of those services will be difficult to replicate in a privacy-preserving way, but Duval has ambitious plans. For the most part, eelo will use various free and open solutions, such as DuckDuckGo for search, OpenStreetMap, ONLYOFFICE, Nextcloud, and so on.

There is already some progress on the user-interface front. The default LineageOS interface was not to Duval's liking, so he and another developer have started working on the "BlissLauncher" with a far different look and feel; there are plans to add a notification scheme and a "control center" for device and app settings and controls.

But, of course, it is hard to bring up an ambitious project without some funding. To that end, Duval created a Kickstarter fundraising effort to raise a modest €25,000 for getting the project going. It hardly seems like enough for the goals he sets out:

We need to bootstrap the project and pay developers for initial development work to reach a first viable "privacy-enabled" eelo product that users will be able to install on their own phones or order on quantity-limited pre-installed eelo smartphones. This functioning product would include: a mobile operating system with new default applications and new user interface

integrated basic web services (search, cloud storage, settings recovery)

updates for 3 years or more, with full respect of user privacy. Then we will be able to attract more contributors and the project will become naturally sustainable as it iterates to new releases.

That seems a tad optimistic, at best, but the Kickstarter is approaching triple the original goal with ten days to run as of this writing. The new goal is €100,000, which will allow more development effort, thus more first-release features. Even that seems like a tight budget to produce what Duval hopes for, but we will have to wait and see. Obviously, his open letter to Elon Musk is an attempt to change the funding situation in a big way.

For now, there are two phones being used for development, the LeEco Le2 and the Xiaomi Mi 5S, but others are on the horizon. Anything that LineageOS supports would seem to be fair game and Duval is taking suggestions.

The project is looking for volunteers, of course, but is also offering some paid positions. There are a number of different specialties listed, from Android developers to web developers to Git wranglers and Mono programmers. It is hard to say how long the positions might last (or will pay), but it is a bit different than most startup free-software projects.

There is another aspect to eelo, part of its mission as a "company in the public interest", which is to help users understand why privacy is important and, thus, why eelo is needed:

eelo's mission includes informing users about all the challenges behind their data privacy. We want to tell the story about what the web giants and state agencies are doing with your personal data - millions of dollars for shareholders, mass surveillance etc. - and why all that is a threat to democracy and peace. eelo.io is going to be a central place for information about user's data privacy.

It all sounds like a set of worthy goals that is, to the extent they will fit Duval's sensibilities, reusing parts and pieces from other free-software projects. Could it be a viable competitor in the existing mobile phone ecosystem? The track record of other efforts in this arena is mixed—none have truly been successful, however. Apple and Google are not likely to be shaking in their boots over eelo or any other alternative, sadly. The main impediment seems to be consumer interest—without users having an interest in protecting their privacy, the existing players will continue to dominate.

Comments (13 posted)

The Meltdown/Spectre debacle has, deservedly, reached the mainstream press and, likely, most of the public that has even a remote interest in computers and security. It only took a day or so from the accelerated disclosure date of January 3—it was originally scheduled for January 9—before the bugs were making big headlines. But Spectre has been known for at least six months and Meltdown for nearly as long—at least to some in the industry. Others that were affected were completely blindsided by the announcements and have joined the scramble to mitigate these hardware bugs before they bite users. Whatever else can be said about Meltdown and Spectre, the handling (or, in truth, mishandling) of this whole incident has been a horrific failure.

For those just tuning in, Meltdown and Spectre are two types of hardware bugs that affect most modern CPUs. They allow attackers to cause the CPU to do speculative execution of code, while timing memory accesses to deduce what has or has not been cached, to disclose the contents of memory. These disclosures can span various security boundaries such as between user space and the kernel or between guest operating systems running in virtual machines. For more information, see the LWN article on the flaws and the blog post by Raspberry Pi founder Eben Upton that well describes modern CPU architectures and speculative execution to explain why the Raspberry Pi is not affected.

Given the nature of the bugs, and that cache-timing side channels have been known for some time, it is a bit surprising that these longstanding flaws were all found relatively recently. In fact, it would appear that they were all found by independent teams (three for Meltdown and two for Spectre) at more or less the same time. That worryingly suggests that it is not outside the realm of possibility that black-hat attackers and/or government spy agencies may have gotten there first—maybe by years. There is no evidence of that occurring, but attacks using the flaws would not be easy to detect, though using the information disclosed might have set off some alarm bells.

Discovery and disclosure

Jann Horn of Google's Project Zero discovered both of the flaws as part of his research; he detailed them in a lengthy blog post. In that post, he noted that Spectre was disclosed to Intel, AMD, and ARM on June 1, but that Meltdown was only reported later. As Jan Wildeboer's extensive timeline shows, Meltdown was found by Horn (and two academic teams) sometime before the July 28 publication of "Negative Result: Reading Kernel Memory From User Mode" by Anders Fogh, which described the Meltdown technique. From the just-released Project Zero bug report, which has lots of good information and proof-of-concept code, we see that Horn refers to "variant 3" (which is Meltdown) on June 22.

After those disclosures, Intel (and perhaps the other CPU vendors) started alerting some of its customers and other interested parties under a non-disclosure agreement (NDA). For Ubuntu, that happened on November 9, according to its timeline. Other Linux distributions were also alerted, though it seems possible that some distributions and, probably, cloud providers were notified earlier. Apple and Microsoft may well have gotten earlier notice too; Apple released fixes for Meltdown on several of its operating systems in mid-December. Since there is, as yet, no timeline from Intel or the other CPU vendors, we are left guessing on who got notified and when.

We do know that many Linux distributions were left out in the cold, as was the BSD community. Smaller cloud companies and others with less clout were left out as well—multiple tier-2 cloud providers have formed a group to help cope with the fallout. As might be guessed, that has left some less than entirely impressed with the whole process.

KAISER and KPTI

The bugs were embargoed, but that started to break down as hints about the flaws were published. Before Fogh's post, Daniel Gruss and others at Graz University of Technology published a paper [PDF] that described ways to break kernel address-space layout randomization (KASLR). It also described a mechanism to avoid those problems, called KAISER, that would unmap kernel memory before entering user-space code. A patch set to implement KAISER was posted to the linux-kernel mailing list on October 31; it was based on the work by Gruss and his team and forward-ported by Dave Hansen.

Those patches were undoubtedly not meant as a Halloween gift to security researchers, but may well have served that purpose. Normally, a patch set that radically changed the memory layout of the running kernel at a fairly high performance cost—ostensibly simply to avoid KASLR breaks—would find tough sledding, but something was clearly different with KAISER. After all, it has never been particularly difficult for attackers to break KASLR and preventing that has never been a "drop everything and fix it" kind of problem. But KAISER was treated as an urgent fix.

The KAISER patches fairly quickly morphed into kernel page-table isolation (KPTI) and were merged for the 4.15-rc6 release, which is unprecedented in recent times; it was also picked up for the 4.14 stable kernel series. Clearly, something important was afoot. That led to lots of discussion about what the real bug is, here and elsewhere. A patch from Tom Lendacky to turn off KPTI for AMD processors led to speculation about speculation (that is, speculative execution). KPTI addressed the Meltdown vulnerability, but Spectre was "unknown" outside of certain rarefied circles until the dam broke on January 3.

When that happened, it could plainly be seen that the state of the mitigations for Spectre was far behind KPTI. There were multiple patches by authors at different companies, some that didn't apply to any known kernel tree, some that didn't compile, and so on. It was clearly the result of the embargo; instead of working together, various organizations were working on their own. The Spectre flaws are definitely harder to mitigate, but not coordinating the thinking and fixes certainly has not helped matters. So far, no fixes for Spectre have made it into the mainline. One would guess that will change shortly after the release of 4.15, which should come in mid-late January.

Grumbling

There has been a fair amount of grumbling about how this process has played out. Without pointing fingers, Greg Kroah-Hartman stated his opinion in a status update on the bug fixes:

As for how this was all handled by the companies involved, well this could be described as a textbook example of how NOT to interact with the Linux kernel community properly. The people and companies involved know what happened, and I'm sure it will all come out eventually, but right now we need to focus on fixing the issues involved, and not pointing blame, no matter how much we want to.

But it is clear that at least parts of the Linux kernel community were aware of Meltdown, at least, as far back as the original posting of the KAISER patches. Other open-source operating systems got no warning at all, other than perhaps some background rumbling because of KAISER/KPTI. As OpenBSD developer Philip Guenther put it:

We have received *no* non-public information. I've seen posts elsewhere by other *BSD people implying that they receive little or no prior warning, so I have no reason to believe this was specific to OpenBSD and/or our philosophy. Personally, I do find it....amusing? that public announcements were moved up after the issue was deduced from development discussions and commits to a different open source OS project. Aren't we all glad that this was under embargo and strongly believe in the future value of embargoes?

Meanwhile, Linux distributors have been working up their fixes, which typically need to be done to older kernels, especially for the enterprise distributions. Because of the lead time needed for QA and the like, enterprise distributions have backported the KAISER work into their kernels, rather than the more recent KPTI work. That led Kroah-Hartman to follow their lead for the 4.4 and 4.9 stable trees in order to try to be "bug for bug compatible" with the enterprise distributions. There are, it seems, lots of users of kernel series that are no longer supported out there. Meltdown and Spectre provide more evidence, if it really is needed, that continuing to run old, out-of-date kernels is a terrifically risky thing to do.

Essentially all of the work that has been merged so far is for the x86 architecture, though others are affected to various degrees. ARM64 is the most prominent of the affected architectures, though it is only affected by Meltdown in its high-end processors (AMD CPUs are not affected by Meltdown at all). Patches to fix the Meltdown problem on the ARM64 processors where it does exist are in progress, but will not be merged until the 4.16 merge window. That means they are not available to be added to stable trees currently, so Kroah-Hartman suggests the Common Android kernel tree, which has branches that incorporate the fixes. It is likely that the 4.4 and 4.9 stable trees will never get those fixes, he said:

For the 4.4 and 4.9 LTS kernels, odds are these patches will never get merged into them, due to the large number of prerequisite patches required. All of those prerequisite patches have been long merged and tested in the android-common kernels, so I think it is a better idea to just rely on those kernel branches instead of the LTS release for ARM systems at this point in time.

According to Kroah-Hartman, other operating systems (e.g. Windows, macOS) do not yet have full Spectre fixes either. Other architectures have not had fixes for either Meltdown or Spectre, though some are believed vulnerable. The name "Spectre" was chosen, in part, because the researchers believed "it will haunt us for quite some time". That looks prescient.

Lessons learned?

There has been additional speculation that other, related problems are either known or will be discovered before long. Regardless, it seems well-nigh certain that CPU security bugs of some kind will be found, so it behooves the industry to try to figure out a less chaotic, and more successful, strategy for handling these kinds of problems.

Much of the ire has been directed toward Intel, which seems to have taken the lead on the response to these bugs. In response to one of the first Spectre-fix postings, Linus Torvalds was characteristically blunt:

I think somebody inside of Intel needs to really take a long hard look at their CPU's, and actually admit that they have issues instead of writing PR blurbs that say that everything works as designed. .. and that really means that all these mitigation patches should be written with "not all CPU's are crap" in mind. Or is Intel basically saying "we are committed to selling you shit forever and ever, and never fixing anything"? Because if that's the case, maybe we should start looking towards the ARM64 people more.

Thomas Gleixner, who was singled out for praise by Torvalds for his work in getting the KPTI patches into shape for merging, also did not mince words. The Spectre bugs have been "fixed" in some distribution kernels, but Gleixner was not impressed with what was done there and is concerned that proper solutions are not being thought out in the name of expediency, which will have long-term implications:

We have neither the basic mitigations in place nor has anyone fully understood the implications and possible further issues. [...] I've seen the insanities which were crammed into the distro kernels, which have sysctls and whatever, but at the same time these kernels shipped in a haste do not even boot on a specific class of machines. Great engineering work.

He decried the "disgusting big corporate games" that required him and others to go into panic mode over the past few months, when the chip vendors knew about the problems months before the kernel community was engaged (to the extent it was). He noted that "brain" is not an acronym for "Big Revenue All Intelligence Nuked" but it certainly appears to him that it is being treated that way.

In general, the normal kernel review cycle is taking place for the Spectre fixes, but there are some wrinkles. Now that the vulnerabilities are public, there is increased pressure to get some kind of fix, even one with an enormous performance penalty, deployed in the near term. That runs against "normal" kernel thinking; Torvalds and others were not persuaded by a "security trumps performance" argument. Few, if any, disagree in the abstract, but Gleixner and others would like to roll out the expedient fixes; after that, optimizations can be examined:

The exploits are out in the wild and they are observed already, so we really have to make a decision whether we want to fix that in the most obvious ways even if it hurts performance right now and then take a break from all that hell and sit down and sort the performance mess or whether we want to discuss the right way to fix it for the next 3 month and leave all doors open until the last bit of performance is squeezed out.

He suggested that a "performance first" attitude might lead him to taking some overdue vacation time. But James Bottomley said that this is all part of the normal kernel review process, which generally leads to better and faster solutions. Alan Cox agreed, but feels the urgency for an immediate solution: "I'd just prefer most users machines are secure while we have the discussion and while we see what other architectural tricks people can come up with".

All of that discussion is pretty normal stuff for the kernel mailing list. The problem is that it is happening after the disclosure of the bugs, so there is a big (and highly public) clamor. Security bugs are frequently handled privately by the team behind security@kernel.org, but that mechanism apparently was not used here. As Kroah-Hartman outlined, that led to this sad state of affairs:

I will note that the "controlled disclosure" for this thing was a total and complete mess, and unlike any that I have ever seen in the past. The people involved in running it had no idea how to do it at all, and because of that, it failed miserably, despite being warned about it numerous times by numerous people. [...] Because that group was so small and isolated that they did not actually talk to anyone who could actually provide input to help deal with the bug. So we are stuck now with dealing with this "properly", which is fine, but please don't think that this is an excuse to blame "controlled disclosure". We know how to do that correctly, it did not happen in this case at all because of the people driving the problem refused to do it.

We will be digging out from Meltdown and Spectre for a long time. With luck, those responsible have learned that there are better ways to handle bugs of this nature (or any other) in the future. It's vanishingly unlikely that we won't hit this situation again—CPUs are extremely complex beasts and will only get more so as nearly anything gets sacrificed on the altar of performance. The CPU vendors—and any who enabled them (e.g. Google, Microsoft, Apple, Amazon, and so on)—should take a hard look at their practices and this incident in particular. Hopefully, they (and we) will learn something from it moving forward.

We also owe a huge debt of gratitude to all of the different folks who worked on getting us this far. Many of them worked over the holidays when they might have been doing something much more fun than cleaning up this mess. By delaying the involvement of the kernel team—and setting January 9 as the coordinated release date—whoever was driving the disclosure bus did no one any favors. Six months is an enormous window for an embargo, it is somewhat surprising it held up as long as it did. In the end, though, that embargo length may possibly have given attackers a longer run time; it definitely turned the kernel-fixing piece into a fiasco. Some rethinking is clearly in order.

Comments (33 posted)

When the Meltdown and Spectre vulnerabilities were disclosed on January 3, attention quickly turned to mitigations. There was already a clear defense against Meltdown in the form of kernel page-table isolation (KPTI), but the defenses against the two Spectre variants had not been developed in public and still do not exist in the mainline kernel. Initial versions of proposed defenses have now been disclosed. The resulting picture shows what has been done to fend off Spectre-based attacks in the near future, but the situation remains chaotic, to put it lightly.

First, a couple of notes with regard to Meltdown. KPTI has been merged for the 4.15 release, followed by a steady trickle of fixes that is undoubtedly not yet finished. The X86_BUG_CPU_INSECURE processor bit is being renamed to X86_BUG_CPU_MELTDOWN now that the details are public; there will be bug flags for the other two variants added in the near future. 4.9.75 and 4.4.110 have been released with their own KPTI variants. The older kernels do not have mainline KPTI, though; instead, they have a backport of the older KAISER patches that more closely matches what distributors shipped. Those backports have not fully stabilized yet either. KPTI patches for ARM are circulating, but have not yet been merged.

Variant 1

The first Spectre vulnerability, known as "variant 1", "bounds-check bypass", or CVE-2017-5753, takes advantage of speculative execution to circumvent bounds checks. If given the following pseudocode sequence:

if (within_bounds(index)) { value = array[index]; if (some_function_of(value)) execute_externally_visible_action(); }

The body of the outer if statement should only be executed if index is within bounds. But it is possible that this body will be executed speculatively before the bounds check completes. If index is controlled by an attacker, the result could be a reference far beyond the end of array . The resulting value will never be directly visible to the attacker, but if the target code performs some action based on the value, it may leave traces somewhere where the attacker can find them — by timing memory accesses to determine the state of the memory cache, for example.

The best solution here (and for the other variants too) would be for the processor to completely clean up the results of a failed speculation, but that's not in the cards anytime soon. So the approach being taken is to prevent speculative execution after important bounds tests in the kernel. An early patch, never posted for public review, created a new barrier macro called osb() and sprinkled calls to it in places where they appeared to be necessary. In the pseudocode above, the osb() call would be placed immediately after the first if statement.

It would appear that this is not the approach that will be taken in the mainline, though, judging from this patch set from Mark Rutland. Rather than place barriers after tests, this series creates a set of helper macros applied to the pointer and array references instead. The documentation describes them in detail. For the example above, the second line would become:

int *element = nospec_array_ptr(array, index, array_size); if (element) value = *element; else /* Handle out-of-bounds index */

If the index is less than the given array_size , a pointer to the indicated value — &array[index] — will be returned; otherwise a null pointer is returned. The macro contains whatever architecture-specific magic is needed to prevent speculative execution of pointer dereferencing operation. This magic is supported by new directives being added to the GCC and LLVM compilers.

Earlier efforts had included a separate if_nospec macro that would replace the if statement directly. After discussion, though, its author (Dan Williams) decided to drop it and use the dereferencing macros instead.

These macros can protect against variant 1 — if they are placed in the correct locations. As Linus Torvalds noted, that is where things get a bit sticky:

I'm much less worried about these "nospec_load/if" macros, than I am about having a sane way to determine when they should be needed. Is there such a sane model right now, or are we talking "people will randomly add these based on strong feelings"?

Finding exploitable code sequences in the kernel is not an easy task; the kernel is large and makes use of a lot of values supplied by user space. It appears that speculative execution can proceed for sequences as long as "180 or so simple instructions", which means that the vulnerable test and subsequent reference can be far apart — even in different functions. Identifying such sequences is hard, and preventing the introduction of new ones in the future may even be harder.

It seems that the proprietary Coverity checker was used to find the spots for which there are patches to date. That is less than ideal going forward, since most developers do not have access to Coverity. The situation may not improve anytime soon, though. Some developers have suggested using Coccinelle, but Julia Lawall, the creator of Coccinelle, has concluded that the task is too complex for that tool.

One final area of concern regarding variant 1 is the BPF virtual machine. Since BPF allows user space to load (and execute) code in kernel space, it can be used to create vulnerable code patterns. The early patches added speculation barriers to the BPF interpreter and JIT compiler, but it appears that they are not enough to solve the problem. Instead, changes to BPF are being considered to prevent possibilities for speculative execution from being created.

Variant 2

Attacks using variant 1 depend on the existence of a vulnerable code sequence that is conveniently accessible from user space. Variant 2, (or "branch target injection", CVE-2017-5715) instead, depends on poisoning the processor's branch-prediction mechanism so that indirect jumps (calls via a function pointer, for example) will, under speculative execution, be redirected to an attacker-chosen location. As a result, a useful sequence of code (a "gadget") anywhere in the kernel can be made to run speculatively on demand. This attack can also be performed across processes in user space, meaning that it can be used to access data outside of a JavaScript sandbox in a web browser, for example.

There are two different variant-2 defenses in circulation, in multiple versions. Complete protection of systems will likely involve some combination of both, at least in the near future.

The first of those is a processor microcode update giving the operating system more control over the use of the branch-prediction buffer. The new feature is called IBRS, standing for "indirect branch restricted speculation". It takes the form of a new bit in a model-specific register (MSR) that, when written, effectively clears the buffer, preventing the poisoning attack. A patch set enabling IBRS usage in the kernel has been posted but, in an example of the rushed nature of much of this work, the patches did not compile and had clearly not been run in their posted form.

The alternative approach is a hackaround termed a "return trampoline" or "retpoline"; this mechanism is well described in this Google page (which also suggests that we should "imagine speculative execution as an overly energetic 7-year old that we must now build a warehouse of trampolines around"). A retpoline replaces an indirect jump or indirect function call with a sequence of operations that, in short, puts the target address onto the call stack, then uses a return instruction to "return" to the function to be called. This dance prevents speculative execution of the call; it's essentially a return-oriented programming attack against the branch predictor. The performance cost of using this mechanism is estimated at 0-1.5%.

Naturally, these retpolines must be deployed to every indirect call in any program (the kernel or something else) that is to be protected. That is not a task that can reasonably be done by hand in non-trivial programs. But it is something that can be given over to a compiler to handle. LLVM patches have been posted to automate retpoline generation, but that is not particularly helpful for the kernel. GCC patches have not yet been circulated, but they can be found in this repository.

Several variants of the retpoline patches for the kernel have been posted by different authors who clearly were not always communicating as well as they could be. The current version, as of this writing, was posted by David Woodhouse. This series changes the kernel build system to use the new GCC option and includes manual conversions for indirect jumps made by assembly-language code. There is also a noretpoline command-line option which will patch out the retpolines entirely.

The retpoline implementation seems to be nearly stable and imposes a relatively small overhead overall. But there is still a lot of uncertainty around whether any given system should be using retpolines or IBRS — or a combination of the two. One might think that a hardware-based mechanism would be preferable, but the performance cost of IBRS is evidently quite high. So it seems that, as a general rule, retpolines are preferable to IBRS. But there are some exceptions.

One of those is that, it would seem, retpolines don't work on Skylake-generation Intel CPUs, which perform more aggressive speculative execution around return operations. Nobody has publicly demonstrated that this speculation can be exploited on Skylake processors, but some developers, at least, are nervous about leaving a possible vulnerability open. As Woodhouse said:

We had IBRS first, and especially on Broadwell and earlier, its performance really is painful. Then came retpoline, purely as an optimisation. A very *important* performance improvement, but an optimisation nonetheless. When looking at optimisations, it is rare for us to say "oh, well it opens up only a *small* theoretical security hole, but it's faster so that's OK".

So the more cautious administrators, at least, will probably want to stick with IBRS on Skylake processors. The good news is that IBRS performs better on those CPUs than it does on the earlier ones.

The other problem is that, even if the kernel can be built with retpolines, other code, such as system firmware cannot be. Concerns about firmware surprised some developers, but it would seem that they are warranted. Quoting Woodhouse again:

In the ideal world, firmware exists to boot the kernel and then it gets out of the way, never to be thought of again. In the Intel world, firmware idiocy permeates everything and we sometimes end up making calls to it at runtime.

The firmware that runs in response to those calls is unlikely to be rebuilt with retpolines in the near future, so it may well contain vulnerabilities to variant-2 attacks. Thus the IBRS bit needs to be set before any such calls are made, regardless of whether IBRS is used by the kernel as a whole.

In summary

From all of the above, it's clear that the development community has not yet come close to settling on the best way to address the Spectre vulnerabilities. Much of what we have at the moment was the result of fire-drill development so that there would be something to ship when the disclosure happened. Moving the disclosure forward by six days at the last minute did not help the situation either.

It is going to take some time for everything to settle down — even if no other vulnerabilities crop up, which is not something that would be wise to count on. It's worth noting that, in the IBRS discussion, Tim Chen said that there are more speculation-related CPU features in the works at Intel. They may just provide better defenses against the publicly known attacks — maybe. But even if no other vulnerabilities are about to jump out at us, it seems almost certain that others will be discovered at some point in the future.

Meanwhile, there is enough work to do just to get a proper handle on the current set of problems and to get acceptable solutions into the mainline kernel. It seems fair to say that these issues are going to distract the development community (for the kernel and beyond) for some time yet.

Comments (73 posted)

Polling a set of file descriptors to see which ones can perform I/O without blocking is a useful thing to do — so useful that the kernel provides three different system calls ( select() poll() , and epoll_wait() — plus some variants) to perform it. But sometimes three is not enough; there is now a proposal circulating for a fourth kernel polling interface. As is usually the case, the motivation for this change is performance.

On January 4, Christoph Hellwig posted a new polling API based on the asynchronous I/O (AIO) mechanism. This may come as a surprise to some, since AIO is not the most loved of kernel interfaces and it tends not to get a lot of attention. AIO allows for the submission of I/O operations without waiting for their completion; that waiting can be done at some other time if need be. The kernel has had AIO support since the 2.5 days, but it has always been somewhat incomplete. Direct file I/O (the original use case) works well, as does network I/O. Many other types of I/O are not supported for asynchronous use, though; attempts to use the AIO interface with them will yield synchronous behavior. In a sense, polling is a natural addition to AIO; the whole point of polling is usually to avoid waiting for operations to complete.

The patches add a new command ( IOCB_CMD_POLL ) that can be passed in an I/O control block (IOCB) to io_submit() along with any of the usual POLL* flags describing the type of I/O that is desired — POLLIN for data available to read, for example. This command, like other AIO commands, will not (necessarily) complete before io_submit() returns. Instead, when the indicated file descriptor is ready for the requested type of I/O, a completion event will be queued. A subsequent call to io_getevents() (or the io_pgetevents() variant, added by the patch set, that blocks signals during the operation) will return that event, and the calling application will know that it can perform I/O on the indicated file descriptor. AIO poll operations always operate in the "one-shot" mode; once a poll notification has been generated, a new IOCB_CMD_POLL IOCB must be submitted for that file descriptor if further notifications are needed.

Thus far, this interface sounds more difficult to use than the existing poll system calls. There is a payoff, though, that comes in the form of the AIO ring buffer. This poorly documented aspect of the AIO subsystem maps a circular buffer into the calling process's address space. That process can then consume notification events directly from the buffer rather than calling io_getevents() . Multiple notifications can be consumed without the need to enter the kernel at all, and polling for multiple file descriptors can be re-established with a single io_submit() call. The result, Hellwig said in the patch posting, is an up-to-10% improvement in the performance of the Seastar I/O framework. More recently, he noted that the improvement grows to 16% on kernels with page-table isolation turned on.

Internally to the kernel, any device driver (or other subsystem that exports a file_operations structure) can support the new poll interface, but some small changes will be required. It is not, however, necessary to support (or even know about) AIO in general. In current kernels, the polling system calls are all supported by the poll() method in struct file_operations :

int (*poll) (struct file *file, struct poll_table_struct *table);

This function must perform two actions: setting up notifications for when the underlying file is ready for I/O, and returning the types of I/O that could be performed without blocking now. The first is done by adding one or more wait queues to the provided table ; the driver will perform a wakeup call on one of those queues when the state of the device changes. The current readiness state is the return value from the poll() method itself.

Supporting AIO-based polling requires splitting those two functions into separate file_operations methods. Thus, there are two new entries to that structure:

struct wait_queue_head *(*get_poll_head)(struct file *file, int mask); int (*poll_mask) (struct file *file, int mask);

(The actual patches use the new typedef __poll_t for the mask , but that typedef isn't in the mainline kernel yet). The polling subsystem will call get_poll_head() to obtain a pointer to the wait queue that will be notified when the device's I/O readiness state changes; poll_mask() will be called to get the current readiness state. A driver that implements these two operations need not (and probably should not) retain its implementation of the older poll() interface.

One potential limitation built into this API is that there can only be a single wait queue that receives notifications for a given file . The current interface, instead, allows multiple queues to be used, and a number of drivers take advantage of that fact to use, for example, different queues for read and write readiness. Contemporary wait queues offer enough flexibility that the use of multiple queues should not be necessary anymore. If a driver cannot be changed, Hellwig said, "the driver just won't support aio poll"

There have not been a lot of comments in response to the patch posting so far; many of the relevant developers have been preoccupied with other issues in the last week. It is hard to argue with a 10% performance improvement, though, so some form of this patch seems likely to get into the mainline sooner or later — interested parties can keep checking the mainline repository to see if it's there yet. Whether we'll see a fifth polling interface added in the future is anybody's guess, though.

Comments (38 posted)

The disclosure of the Meltdown and Spectre vulnerabilities has brought a new level of attention to the security bugs that can lurk at the hardware level. Massive amounts of work have gone into improving the (still poor) security of our software, but all of that is in vain if the hardware gives away the game. The CPUs that we run in our systems are highly proprietary and have been shown to contain unpleasant surprises (the Intel management engine, for example). It is thus natural to wonder whether it is time to make a move to open-source hardware, much like we have done with our software. Such a move may well be possible, and it would certainly offer some benefits, but it would be no panacea.

Given the complexity of modern CPUs and the fierceness of the market in which they are sold, it might be surprising to think that they could be developed in an open manner. But there are serious initiatives working in this area; the idea of an open CPU design is not pure fantasy. A quick look around turns up several efforts; the following list is necessarily incomplete.

What's out there

Consider, for example, the OpenPOWER effort, which is based on the POWER architecture. It is not a truly open-source effort, in that one has to join the club to play, but it is an example of making a processor design available for collaborative development. Products based on the (relatively) open designs are shipping. OpenPOWER is focused on the high end of the computing spectrum; chips based on this design are unlikely to appear in your handset or laptop in the near future.

Then, there is OpenSPARC, wherein Sun Microsystems fully opened the designs of the SPARC T1 and T2 processors. A few projects tried to run with these designs, but it's not clear that anybody got all that far. At this point, the open SPARC designs are a decade old and the future of SPARC in general is in doubt. Interesting things could maybe happen if Oracle were to release the designs of current processors, but holding one's breath for that event is probably not the best of ideas.

OpenRISC is a fully open design for a processor aimed at embedded applications; it has one processor (the OpenRISC 1000) in a complete state. Some commercial versions of the OpenRISC 1000 have been produced, and reference implementations (such as the mor1kx) exist. The Linux kernel gained support for OpenRISC in the 3.1 release in 2011, and a Debian port showed up in 2014. The Debian work shut down in 2016, though. Activity around the kernel's OpenRISC code has slowed, though it did get SMP support in 2017. All told, OpenRISC appears to have lost much of the momentum it once had.

Much of the momentum these days, instead, appears to be associated with the RISC-V architecture. This project is primarily focused on the instruction-set architecture (ISA), rather than on specific implementations, but free hardware designs do exist. Western Digital recently announced that it will be using RISC-V processors in its storage products, a decision that could lead to the shipment of RISC-V by the billion. There is a development kit available for those who would like to play with this processor and a number of designs for cores are available.

Unlike OpenRISC, RISC-V is intended to be applicable to a wide range of use cases. The simple RISC architecture should be relatively easy to make fast, it is hoped. Meanwhile, for low-end applications, there is a compressed instruction-stream format intended to reduce both memory and energy needs. The ISA is designed with the ability for specific implementations to add extensions, making experimentation easier and facilitating the addition of hardware acceleration techniques.

The Linux support for RISC-V is quite new; indeed, it will only appear once the 4.15 release gets out the door. The development effort behind it appears to be quite active, and toolchain and library support are also landing in the appropriate projects. RISC-V seems to have quite a bit of commercial support behind it — the RISC-V Foundation has a long list of members. It seems likely that this architecture will continue to progress for some time.

A solution to the hardware problem?

In response to Meltdown and Spectre, the RISC-V Foundation put out a press release promoting the architecture as a more secure alternative. RISC-V is indeed not vulnerable to those problems by virtue of not performing any speculative memory accesses. But the Foundation says that RISC-V has advantages that go beyond a specific vulnerability; the openness of its development model, the Foundation says, enables the quick incorporation of the best security ideas from a wide range of developers.

It has become increasingly clear that, while Linux may have won the battle at the kernel level, there is a whole level of proprietary hardware and software that runs below the kernel that we have no control over. An open architecture like RISC-V is thus quite appealing; perhaps we can eventually claw some of that control back. This seems like a dream worth pursuing, but getting there involves some challenges that must be overcome first.

The first of these, of course, is that while compilers can be had for free, the same is not true of chip fabrication facilities, especially the expensive fabs needed to create high-end processors. If progress slows at the silicon level — as some say is already happening — and fabrication services become more available to small customers, then it may become practical for more of us to experiment with processor designs. It will never be as easy or as cheap as typing "make", though.

Until then, we're going to remain dependent on others to build our processors for us. That isn't necessarily bad; almost all of us depend on others to build most of our software for us as well. But a higher level of trust has to be placed in hardware. Getting reproducible builds working at the software level is a serious and ongoing challenge; it will be even harder at the hardware level. But without some way of verifying underlying design of an actual piece of hardware, we'll never really know if a given chip implements the design that we're told it does.

Nothing about the RISC-V specification mandates that implementation designs must be made public. Even if RISC-V becomes successful in the marketplace, chances are good that the processors we can actually buy will not come with freely licensed designs. Large customers (those that build their own custom data centers) may well be able to insist on getting the designs too — or just create their own — but the rest of us will find ourselves in a rather weaker bargaining position.

Finally, even if we end up with entirely open processors, that will not bring an end to vulnerabilities at that level. We have a free kernel, but the kernel vulnerabilities come just the same. Open hardware may give us more confidence in the long term that we can retain control of our systems, but it is certainly not a magic wand that will wave our problems away.

None of this should prevent us from trying to bring more openness and freedom to the design of our hardware, though. Once upon a time, creating a free operating system seemed like an insurmountably difficult task, but we have done it, multiple times over. Moving away from proprietary hardware designs may be one of our best chances for keeping our freedom; it would be foolish not to try.

Comments (123 posted)