This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A proposal to add Flatpak as an option for distributing desktop applications in Fedora 27 has recently made an appearance. It is meant as an experiment of sorts to see how well Flatpak and RPM will play together—and to fix any problems found. There is a view that containers are the future, on the desktop as well as the server; Flatpaks would provide Fedora one possible path toward that future. The proposal sparked a huge thread on the Fedora devel mailing list; while the proposal itself doesn't really change much for those uninterested in Flatpaks, some are concerned with where Fedora packaging might be headed once the experiment ends.

Flatpak, which was originally known as xdg-app, is both a packaging format and a mechanism to sandbox applications that is inspired, to some extent, by container technologies (e.g. Docker). It is meant to make it easier for users to install applications by bundling any needed dependencies (beyond the standard Flatpak runtime bundle) into the package. Flatpak would make it easier to get the "latest and greatest" version of an application or to run multiple versions side by side. The sandboxing features are targeted at providing secure compartmentalization so the applications cannot interfere with each other—or escape their sandbox if they get compromised. The vision is that projects can create a single Flatpak that could be installed on multiple distributions.

The GNOME community has been instrumental in developing Flatpak and many Fedora team members have been involved as well—not surprising given the overlap between the GNOME and Fedora teams. Owen Taylor is the owner of the proposal "to enable package maintainers to build Flatpaks of their applications and make those Flatpaks available for installation". The plan is fleshed out further on a Fedora wiki page. The idea is to make it possible for package maintainers to process the standard Fedora RPMs for their packages into Flatpaks.

In order to do that, there are two pieces that need to be built: a runtime and an application. The runtime is a collection of common libraries that Flatpaks can depend on; there would be versions for each Fedora release, which could coexist on a given system. The application piece consists of the program of interest, and any additional libraries it needs, bundled up in the Flatpak-specific format (which may require rebuilding the application and libraries with different build options). The target of the Fedora proposal is Flatpaks for graphical applications, so the runtime would be filled with GNOME-specific libraries.

Proposal

In early July, Fedora program manager Jaroslav Reznik posted the feature proposal as part of the normal review process for Fedora 27 features. The first response, perhaps predictably, came from Kevin Kofler, who asked a number of important questions before concluding: "I strongly oppose this change." The proposal says that Flatpaks will be built from RPMs, but those will not be the standard Fedora RPMs for the packages, as they will need to be relocated into the filesystem hierarchy used by Flatpak. Kofler asked if only Flatpaks would be shipped for these packages or, if not, which RPMs would be available.

Taylor responded that the rebuilt RPMs are not really useful outside of the Flatpaks, though they could still be downloaded from Koji. The regular RPMs would be available along with the Flatpaks, he said. He described his vision of where this might all be leading, which is part of what caused a bit of an uproar in the Fedora world. That vision was not part of the proposal, but suggested that over the following two releases (i.e. Fedora 28 and 29), graphical applications would fully move into Flatpaks and that standard RPMs might be dropped for them. He concluded:

But this is really highly dependent on how modularity work happens more widely in Fedora. "standard RPM packaging" assumes we still have a F tag in Koji where everything is built together with common coordinated dependencies. The Change proposal, in any case is really only about enabling this as an something that packagers may opt into if they want to.

Kofler's second set of questions had to do with the advantages of shipping Flatpaks for Fedora. The existing RPM-based distribution is working and Flatpaks have only downsides, he said:

I see only drawbacks compared to RPM, because everything not included in the base runtime must be bundled, so we have all the usual issues of bundled libraries: larger downloads, more disk consumption, more RAM consumption (shared system libraries are also shared in RAM), slower and less efficient delivery of security fixes, FHS [Filesystem hierarchy standard] noncompliance, etc. And the portability argument is moot when we are talking about delivering Fedora software to Fedora users.

Taylor said that he believed the proposal itself answered that question in its "Benefit to Fedora" section, which lists several benefits. The main ones seem to be that it allows application maintainers to choose their dependencies separately from the versions in the Fedora release, it provides a way of testing different versions of applications, and that it can sandbox some applications. Bastien Nocera disagreed with Kofler's assessment; he dismissed the FHS compliance question and questioned the assertion of slower security fixes, while also listing his set of "positive changes".

Sandboxes

Kofler pointed out the problems he sees with rebuilding the libraries for the Flatpak layout; he also described ways that the process of updating Flatpaks will lead to slower security fixes. But a good chunk of his response concerned sandboxing; he is not convinced that the "Flatpak way" is the right way forward. Michael Catanzaro acknowledged many of Kofler's points, but was enthusiastic about the Flatpak sandboxing. But, as Andy Lutomirski noted, there are two elements to Flatpak that aren't necessarily tied:

Flatpak provides two things that are very nearly orthogonal: packaging and sandboxing. Packaging is the system of bundles, apps, runtimes, etc that allows you to build a Flatpak, send it to a different machine, and run it there, even if the other machine runs a different distro. Sandboxing is Flatpak's system of portals, confinement, etc. Aside from the fact that both are based on namespaces, I see no reason at all that they need to be conflated. It should be entirely possible for Flatpak [to] run an "app" that is actually a conventional RPM installed on the host system using host libraries.

So, if sandboxing can be provided by other means, "what on earth is the point of forcing packagers to make Flatpaks?", Richard W. M. Jones asked. Others agreed with that assessment and wondered what Flatpaks provide that can't be solved with RPM packages. But it is a rare Fedora system that only has RPMs from Fedora repositories installed, as Bill Nottingham pointed out. Typically systems are built up from other sources as well: Coprs, RPMs from elsewhere, packages from language-specific repositories (e.g. PyPI), containers from various sources, packages retrieved using curl , and software built from tarballs. He continued:

If the only answer Fedora has for this is "convince everyone to only build RPMs using system [repo] components"... that's fighting a rear-guard battle that has already been lost. I don't think supporting Flatpak apps is necessarily any worse than what already has to happen with all of the above.

And Flatpaks do have some other advantages, as Taylor outlined:

There are no scriplets with [Flatpaks] - no arbitrary code execution at install time.

There is no ability for Flatpaks to drop arbitrary files at arbitrary locations on your system. Well, the nice thing is that: The idea is that you don't *have* to inspect a flatpak before installation to make sure that it's not dangerous.

Fedora == RPM?

But in another sub-thread, Kofler wonders why users would get Flatpaks from Fedora; why wouldn't they just get them from upstream? RPMs are an important feature of Fedora, he said:

The whole point of delivering software under the Fedora umbrella is to deliver it as RPMs. If there is no RPM, delivering through Fedora is completely useless.

Fedora project leader Matthew Miller took exception to that characterization: "I strongly dispute the idea that Fedora must be tied to a particular packaging technology." But Stephen J. Smoogen agreed with Kofler, at least from a branding standpoint: "RPM is part and parcel of what makes Fedora for most people." Others, including Miller, disagreed; various examples, counter-examples, and car analogies were offered up, but it seems there are some fundamentally different views of what Fedora is.

The "plan" that Taylor outlined is part of what has gotten some in the Fedora community riled. It posits a future without RPMs, at least for some packages, but it is only Taylor's vision, not something that is currently being considered. As he put it:

But I want to be clear that there is no *proposal* on the table to ship things Flatpak only, and *no proposed timescale*. And there won't be until we know how the tools work out for packagers, how Flatpak usage works out for users, and we have a significant body of Fedora packages built as Flatpaks to look at things like installed size and network usage. These are things we can only get to by building out the infrastructure so that packagers can start trying building Flatpaks and users can start trying installing them.

The intent seems to be to test out Flatpaks in a "real world" environment to see what advantages, problems, and downsides they have. Many, including Miller, believe it is in keeping with the nature of the distribution to do that kind of experiment. There are a number of ideas swirling around the industry these days, and containerization is one of them, so it makes sense for Fedora to explore that, Christian Schaller said:

Containers have caught on due to solving some important problems and thus people are looking at models for what the future operating system would look like where containers are the primary content delivery mechanism. In Fedora we have efforts [around] Docker/OCI containers and Flatpak containers and we are looking at image based OS installs with the Atomic and Atomic Workstation effort. The fact that we are developing stuff like this in Fedora is a good thing as it means that if it does turn out to be a better model we are well positioned to take advantage of the shift in the market. And if the scepticism some people have about containers turns out to be well founded we still have our RPM based OS to fall back on.

The Fedora Engineering Steering Committee (FESCo) took up the proposal at its July 21 meeting. As can be seen in the log (starting at 16:04), FESCo members noted the opposition, but found that it was mostly ideological differences over packaging formats. Adding Flatpaks in parallel to RPMs is not really harming anyone or anything. If the experiment is successful, perhaps there will be other proposals down the road that do change the picture with regard to RPM availability, but those can be dealt with then. In the end, FESCo unanimously approved the proposal, so Fedora 27 should be a good testbed for those who are interested in trying out Flatpaks.

Comments (69 posted)

Is it truly an efficient use of cloud computing resources to run traditional operating systems inside virtual machines? In many cases, it isn't. An interesting alternative is to bundle a program into a unikernel, which is a single-tasking library operating system made specifically for running a single application in the cloud. A unikernel packs everything needed to run an application into a tiny bundle and, in theory, this approach would save disk space, memory, and processor time compared to running a full traditional operating system. IncludeOS is such a unikernel; it was created to support C++ applications. Like other unikernels, it is designed for resource-efficiency on shared infrastructure, and is primarily meant to run on a hypervisor.

Frequently, virtual machines end up running a full server operating system, though the entire instance is devoted to running only a few applications or even just one. However, every running instance on a physical machine means a full set of services and binaries that's unnecessarily replicated. Unikernel developers take the opportunity to aggressively pare down the operating system to a bare minimum. Unikernels are at the extreme end of the possible answers to the question "how small can you make an operating system?" A unikernel is an instance of a single program "baked together" with a small library that provides the operating system and acts as an interface to the (virtual) hardware.

A history of unikernels

The idea of shrinking the operating system has its roots in microkernel research, which was spurred by monolithic kernels that were growing in size and complexity to unwieldy levels. A microkernel implements only a tiny amount of necessary functionality in privileged mode (such as interrupt handling, low-level memory management, and scheduling), with the rest being implemented as servers in user space. Exokernels, which were proposed by systems researchers at MIT in the 1990s, take the concept further by implementing most of the operating system as custom libraries linked to applications. This concept of library operating systems proved popular, and a number of projects were created around the concept, such as Nemesis from the University of Glasgow, and Drawbridge from Microsoft Research.

The term unikernel was proposed by a group of operating systems researchers in a paper [PDF] from 2013 that described their MirageOS project. While early projects included various drivers to support a multitude of hardware much like a traditional operating system, unikernels were designed to primarily run on virtual hardware, so they do not need as much driver support. Unikernels are also compiled with just enough of the library to support the application contained within it, and nothing more. The idea is that unikernels could be deployed side by side on a hypervisor, much like regular programs are run on a traditional operating system.

Unikernels address the use case of needing strong isolation for a user's application on shared infrastructure. Multi-tenancy on clouds means that every user's application is completely separated from those of others, but requiring each user to run a full operating system is wasteful. Unlike Linux containers, which run a single instance of the kernel that partitions users' applications using namespaces, control groups, and security policies, unikernels benefit from the stronger resource isolation of hypervisors. They get that isolation while being nearly as lightweight as a container. The drawbacks to unikernels are that users are constrained by what the unikernel library provides in terms of operating system interfaces.

The choice of programming language to write a unikernel application in is also dependent on the underlying library support for it. IncludeOS supports C++, while MirageOS uses OCaml as its target programming language; other unikernel projects have been created that support languages like Haskell (HaLVM) and Erlang (LING). There is a collection of links to active unikernel projects found here.

IncludeOS

IncludeOS is a project to create a C++ API for the development of unikernel-based applications. When an application is built using IncludeOS, the development toolchain will link in the parts of the IncludeOS library required to run it and create a disk image with a bootloader attached. An IncludeOS image can be hundreds of times smaller than the Ubuntu system image for running an equivalent program. Start times for the images run in the hundreds of milliseconds, making it possible to spin up many such virtual machine images quickly.

When an IncludeOS image boots, it initializes the operating system by setting up memory, running global constructors, and registering drivers and interrupt handlers. In an IncludeOS unikernel, virtual memory is not enabled, and a single address space is used by both the application and the unikernel library. Therefore there is no concept of system calls or user space; all operating system services are called with a simple function call to the library and all run in privileged mode.

The unikernel is also single-threaded, and there is no preemption. Interrupts are deferred when they happen, and attended to at every iteration of the event loop. The design suggests user programs also be written to follow the asynchronous programming model, with callbacks installed to respond to operating system events. For example, a TCP socket can be set up in a user program and a callback inside the application handles the connection when a third party attempts to connect.

An advantage of IncludeOS's minimalist design is the reduction of the attack surface for the application. With a self-contained application appliance, there are no shells or other tools that would be helpful to an attacker if they manage to compromise the application. Additionally, the stack and heap locations are randomized to discourage attackers.

IncludeOS does not implement all of POSIX. It is the opinion of the developers that only parts of POSIX will be implemented, as needs arise. It is unlikely that full POSIX compliance will ever be pursued as a goal by the developers. Currently, there are no blocking calls implemented in IncludeOS, as the current event loop model is the favored way to use it. IncludeOS also lacks a writable filesystem at this point.

There are plans in the pipeline to implement threads as fibers, which are a cooperative form of threading. Since there is no preemption in IncludeOS, fibers yield voluntarily to give other fibers a chance to run. Apart from some standard C++ library calls, a special IncludeOS API is used to help construct applications as unikernels.

Business model

IncludeOS started off as a university research project at Oslo and Akershus University College of Applied Sciences; it was developed by Alfred Bratterud and his associates. The project spun off into a startup, founded by Bratterud together with Per Buer. IncludeOS is distributed under the Apache 2.0 license, with the code available on GitHub. Outside of the company, there is a small community of voluntary contributors that numbers around a dozen people. Although most contributions from volunteers are small bug fixes, there have been some considerable contributions by IBM, which added support for running IncludeOS on ukvm.

As a company, IncludeOS is still in the early stages. According to Buer, most of the funding it has received is in the form of grants from the Norwegian government. The code for the IncludeOS unikernel is open source, but there is a plan to create proprietary enterprise management tools for running unikernels in large deployments in data centers and in the cloud. The company has acquired a customer that it is adding features for, such as network load balancing, a firewall, and additional hardening of the codebase. Other missing features will be added as needed, which will primarily be driven by the business needs of customers.

Trying it out

Currently, there are no IncludeOS packages for Linux, but there are instructions on how to create a unikernel from the source code. IncludeOS works on KVM/QEMU and VirtualBox; in theory it could also boot on bare-metal hardware, but this has not been verified.

Since the code is currently not yet meant for production, the results of following the instructions may vary. I tried multiple installations in different versions of Ubuntu and got as far as compiling a unikernel image and running the sample application, which is an HTTP server. However the network bridging between the unikernel and its host was not set up right, and thus I could not connect to it from a web browser. Despite the helpful support from members of the developer community in the IncludeOS development chat room, something in my set up caused problems that could not be reproduced. The compilation and installation scripts are rough around the edges, so any user trying them out may also face problems. Ubuntu users will need at least version 16.04 to build the latest version of IncludeOS.

Conclusion

Despite the popularity of cloud computing and virtualization, we are still trying to figure out the best ways to take advantage of the technology. Containers grew out of the desire for lightweight partitioning of guest applications, but unikernels appear to provide an even better option with stronger isolation. The downside of a completely new operating system and programming paradigm is that most legacy software will not work on it without significant modification. However, lightweight, virtualized, and isolated software appliances are a logical way to run applications in the cloud; as IncludeOS and other unikernels become more sophisticated, it may become the primary method of deploying such services. With several different competing unikernel projects taking off, it will be interesting to see how IncludeOS (and the unikernel paradigm itself) fares against more traditional operating systems. Unikernels are highly specialized, and it remains to be seen if the lightweight virtualization aspect of deployment is enough of an incentive for developers to invest time and resources into building applications in this manner.

[I would like to thank Per Buer and the rest of the IncludeOS development community for their feedback when writing this article.]

Comments (19 posted)

Improving the security of a system often involves tradeoffs, with the costs measured in terms of convenience and performance, among others. To their frustration, security-oriented developers often discover that the tolerance for these costs is quite low. Defenses against reference-count overflows have run into that sort of barrier, slowing their adoption considerably. Now, though, it would appear that a solution has been found to the performance cost imposed by reference-count hardening, clearing the way toward its adoption throughout the kernel.

Reference-count overflows typically come about as the result of a programming error. Code that increments the reference count on an object may neglect to decrement it in certain error paths, for example. Such errors can allow an attacker to repeatedly increment a counter until it overflows, at which point the object in question can be made to appear to be unused and freed while it is, in fact, still in use. The resulting use-after-free vulnerability is often exploitable to fully compromise the system.

The path toward protection against reference-count overflows in the kernel has been a long one. It started with code from the PaX/grsecurity patch set, but the initial approach of adding protection to the core atomic_t type ran into opposition and had to be changed. The next step was to introduce a new refcount_t type specifically for reference counts and to add the protections there. This type was merged for the 4.11 development cycle and various kernel subsystems were changed to use it, but refcount_t upset the networking developers, who were unwilling to pay the performance cost associated with it.

The networking layer is often where such patches run into trouble, but it was not the only place this time around. Andrew Morton recently complained about a refcount_t conversion in the interprocess communication (IPC) subsystem, for example, saying that there was no point in slowing down "simple, safe, old, well-tested code". It began to appear that, even if reference-count protection were added throughout the kernel, it would be disabled by distributors who feared the performance hit.

One of the core truths of secure-systems development is that disabled (or never implemented) protective measures are remarkably ineffective at stopping attackers. Another one is that "safe, old, well-tested" code may be merely old, as Ingo Molnar pointed out:

It's old, well-tested code _for existing, sane parameters_, until someone finds a decade old bug in one of these with an insane parameters no-one stumbled upon so far, and builds an exploit on top of it.

Truly protecting the kernel against reference-count overflows requires making the checks as universal as possible. That, in turn, requires either convincing developers to accept the performance cost of those checks or finding a way to reduce that cost to acceptable levels. The latter course is almost certainly the path of least resistance — if a solution to the performance cost can be found.

Update: the single instruction mentioned to the left has been claimed by Pax Team as his work. The patch set remains Kees's.

atomic_t

atomic_t

refcount_t

refcount_t

With his fast refcount overflow protection patch set , Kees Cook would indeed appear to have found that solution. It works by adding a single instruction to the existing (highly optimized)implementation that catches the case where the reference count goes negative (as happens when the counter overflows). The instruction is especially easy for the processor's branch-prediction logic to guess correctly, so it performs well, as demonstrated by microbenchmark results posted with the patch set. The standardimplementation ran the benchmark in 82.249 billion cycles; the newcode, instead, took 82.211 billion cycles — exactly the same within the margin of error, in other words. The olderimplementation requires 144.8 billion cycles to run the test, for comparison.

The current patch set is for the x86 architecture only. Since assembly work is required, each of the other architectures will need to be added individually when somebody gets around to doing it. There do not appear to be significant obstacles to making this technique work on the other major architectures.

There is a cost to this change, relative to the full refcount_t implementation: it no longer detects the "increment from zero" case. If an object's reference count drops to zero, that object will normally be freed; a subsequent increment operation suggests that a reference still existed and the freed object may still be in use. This, obviously, would be a good situation to catch, but nobody has found a way to do so without adding to the expense of increment operations. Cook claimed in the patch set that the overflow case that the new refcount_t does catch is the most common, though, and cited two exploits published in 2016 (CVE-2014-2851 and CVE-2016-0728) that would have been blocked had that checking been in place.

There are still some developers who remain unenthusiastic about the refcount_t type; see this complaint from Eric Biederman (and Cook's response) for example. The remaining disagreements seemed to be based on a couple of arguments: (1) refcount_t doesn't fix all reference-count-related problems, and (2) using it implies a presumption of bugginess that some developers find hurtful to their pride. But, with the performance issue seemingly solved, those other complaints seem unlikely to block the implementation of reference-count hardening in most of the kernel. That can only be good news for those who are concerned about security.

Comments (34 posted)

There are a few reasons for wanting the ability to get proper stack traces out of the kernel, including profiling, tracing, and debugging kernel crashes. Historically, the kernel's tracebacks have been unreliable for a number of reasons, most of which have been fixed in recent years. Now it seems likely that the 4.14 kernel will include a new mechanism that should put our traceback problems behind us — for now.

The state of the kernel's call stack can be surprisingly hard to interpret. Normally, it is made up of normal C function calls, but then assembly-language code, interrupts, processor traps, etc. tend to confuse the picture. A confusing stack can, naturally, cause the "unwinder" code that tries to derive the current call chain from that stack's contents to do strange things; as a result, the kernel has long eschewed any sort of complicated unwinding code. For the most part, developers who deal with kernel tracebacks have learned to cope with occasional bad data.

The live patching effort, though, depends on accurate call-stack information for its consistency model; in short, it needs to be able to tell which functions appear in the call stack of any thread in the system. Getting there involved implementing the compile-time stack validation mechanism to ensure that all kernel code keeps the stack in reasonable condition at all times. The final step is a proper unwinder that uses this now-reliable stack information.

Last May, an attempt to add such an unwinder based on the DWARF debugging records emitted by the compiler ran into trouble when Linus Torvalds saw it. He noted that, the last time this experiment was tried, the unwinder ran into continual problems from changes to assembly-language code or problems with incorrect DWARF records and, as a result, proved to be unmaintainable. Thus, he said: "I do not ever again want to see fancy unwinders with complex state machine handling used by the oopsing code." So DWARF, which requires that sort of complexity, did not appear to be a good option.

That might have been the end of the story, given that Torvalds was firm in his position, but Josh Poimboeuf mentioned an idea he had been pondering for a bit. The objtool utility that performs stack validation at compile time builds a model of the state of the stack at every point in the built kernel. Perhaps, he thought, objtool could emit the debugging records to make that information available to the unwinder in a format rather simpler than DWARF. The result could be a more reliable unwinder using a more efficient data format that, importantly, is fully under the control of the kernel community and, one would hope, relatively unlikely to break.

Two months or so later, the result is the ORC unwinder. The name ostensibly stands for "oops rewind capability", though it's obviously a play on DWARF (which, in turn, is a play on the ELF executable format). The new ORC format is simple at its core; it is based on this structure:

struct orc_entry { s16 sp_offset; s16 bp_offset; unsigned sp_reg:4; unsigned bp_reg:4; unsigned type:2; };

The purpose of an orc_entry structure is to tell the unwinder code how to orient itself on the stack. There is one of these structures associated with each executable address in the kernel, along with a simple data structure allowing the unwinder to find the correct entry given a program-counter address.

The interpretation of the structure depends on the type field. If it is ORC_TYPE_CALL , the code is running within a normal C-style call frame, and the beginning of that frame can be found by adding the sp_offset value to the value found in the register indicated by sp_reg . If, instead, type is ORC_TYPE_REGS , then that sum points to a pt_regs structure describing the processor (and stack) state prior to a system call. Finally, ORC_TYPE_REGS_IRET says that sp_reg and sp_offset can be used to find a return frame for a hardware interrupt. Those three possibilities appear to be enough to describe any situation that will be encountered, at least on the x86 architecture. (The bp_reg and bp_offset fields don't appear to have much use in the current implementation).

The resulting mechanism is far simpler than the DWARF mechanism. Among other things, that means it's quite a bit faster — a factor of at least 20x is claimed. Unwinding performance may not matter much when responding to a kernel oops, but it's a big deal for function tracing and profiling. The ORC approach is also claimed to be more reliable than telling the compiler to use frame pointers, and it doesn't suffer from the significant performance hit that frame pointers bring with them. And, as noted above, the ORC format is entirely under the control of the kernel community, so it shouldn't break with new compiler versions and, if it does, kernel developers can fix it.

Of course, it's hard to predict just how creative the compiler developers of the future may be when it comes to breaking call-stack information. Poimboeuf acknowledges that risk in the patch posting, but notes that:

If newer versions of GCC come up with some optimizations which break objtool, we may need to revisit the current implementation. Some possible solutions would be asking GCC to make the optimizations more palatable, or having objtool use DWARF as an additional input, or creating a GCC plugin to assist objtool with its analysis.

The other disadvantage is that the ORC format takes more space than DWARF, occupying 1MB or so of extra memory. Poimboeuf suggested that the memory use could be reduced if it turns out to be a real problem. "However, it will probably require sacrificing some combination of speed and simplicity".

Torvalds has not yet made his feelings known regarding the ORC patches, though he had in the past indicated that he thought the combination of objtool and a simpler format might work. Ingo Molnar, meanwhile, has applied the patches to the tip tree, indicating that they are likely to show up in a 4.14 pull request. So, barring last-minute problems, the multi-year effort to get a reliable stack unwinder in the kernel may be close to completion.

Comments (14 posted)

membarrier()

The membarrier() system call is arguably one of the strangest offered by the Linux kernel. It expensively emulates an operation that can be performed by a single unprivileged barrier instruction, using an invocation of the kernel's read-copy-update (RCU) machinery — all in the name of performance. But, it would seem,is not fast enough, causing users to fall back to complex and brittle tricks. An attempt to fix the problem is now under discussion, but not everybody is convinced that the cure is better than the disease.

membarrier()

membarrier() was first discussed in 2010. The initial use case was to support user-space RCU, which uses a shared-memory variable to indicate that a thread is running in an RCU critical section. Changes to RCU-protected objects (and, in particular, the freeing of the old version of a changed object) cannot happen while any thread is in an RCU critical section, so code that performs such an operation must check this shared flag to ensure that the change is safe. This scheme can be thwarted, though, if the processor reorders operations, causing the object to be freed before the variable is checked.

Processors provide memory-barrier instructions so that this kind of scenario can be prevented. Unfortunately, these instructions are relatively slow, since they must serialize access across the entire machine. Memory barriers must also occur in pairs to function properly; in this case, one barrier would be needed whenever a thread sets the "in RCU critical section" flag, while the other would happen after that flag is checked, but before any subsequent action is taken. This symmetric pairing of barriers works well in many situations, but it is poorly suited to the RCU use case in particular.

The problem comes from the fact that entry into an RCU critical section is a frequent occurrence, while changes to RCU-protected objects can be quite rare. So it is possible that hundreds (or more) rcu_read_lock() calls will be made where no thread is trying to change the protected objects; in such cases, all of the overhead incurred by those memory barriers is wasted. In situations where this sort of asymmetrical access pattern pertains, it would be worthwhile to greatly increase the cost of a memory-barrier operation — if that cost could be moved entirely to the thread performing the change, allowing the read path to be fast.

That is where membarrier() comes in. The initial version simply sent an inter-processor interrupt (IPI) causing every processor to execute a memory-barrier instruction. That approach was not entirely popular, since the IPIs wake every processor on the system and can cause unexpected latencies for realtime threads. Subsequent discussion caused the implementation to shift to calling synchronize_sched() , a kernel function that, among other things, ensures that every processor will have executed a memory barrier. At the time, the patches included an "expedited" option that would use IPIs instead, but when membarrier() was merged (many years later, in 2015), that option was not included.

The expedited option

Recently, Paul McKenney posted a patch adding the expedited option back to membarrier() . This change raised some eyebrows, since the concerns about IPIs have not gone away. Mathieu Desnoyers, the original author of the membarrier() patch, asked how it was possible to offer the expedited option without impacting realtime processes, and Peter Zijlstra worried about the denial-of-service attack that can be carried out by code as simple as:

for (;;) membarrier(MEMBARRIER_CMD_SHARED_EXPEDITED, 0);

At the moment, it would seem, there are no new answers to any of those questions, but there is a stronger incentive to add the expedited option, and appears that this option is not creating any problems that do not already exist.

As McKenney described it, there are a number of users who are finding that the existing membarrier() system call is too slow. That is perhaps unsurprising; synchronize_sched() will force the calling thread to block until every CPU in the system goes through an RCU grace period, so there is a certain amount of latency built in. These users have found a trick to get the desired behavior without calling membarrier() : they make a call to either mprotect() or munmap() instead. Either of those system calls will, on an x86 system, cause an IPI to be issued to ensure that the affected address ranges are removed from each translation lookaside buffer (TLB). They also cause a certain amount of useless memory-management overhead but, evidently, the end result is still faster than calling membarrier() .

Besides its fundamental inelegance, this approach has a couple of problems. One is that it could easily break in future kernels or on future hardware if those system calls can be made to work without IPIs; if such an optimization opportunity presents itself, the kernel developers are highly likely to take it. In fact, the IPIs are not necessary on all current hardware, leading McKenney to note that this trick "has the slight disadvantage of not working at all on arm and arm64". Adding the IPI capability to membarrier() will allow for better performance on all architectures without the need to resort to tricks.

Since users can already create IPIs at will with the memory-management calls, McKenney does not believe that adding that ability to membarrier() will make things worse. But there are, he said, a few things that could be done to reduce the potential for abuse of the expedited option. These range from complete "defanging" by disabling expedited grace periods at boot time to limiting the number of expedited membarrier() calls that can be made in a given time period. Various approaches to limiting the IPIs to the processors that actually need to receive them (those processors actually running threads from the application calling membarrier() ) are also under consideration. Providing a mechanism for expedited barriers will, at least, give the kernel community the possibility of handling any abuse.

This is a patch that is likely to go through further revisions and discussion before it makes it close to the mainline. Among other things, the people who have been calling for a faster membarrier() need to verify that the expedited option solves their problem. "Obviously, unless there are good test results and some level of user enthusiasm, this patch goes nowhere", McKenney said. The actual code, at the moment, fits on a single screen; the discussion around it seems unlikely to be so concise.

Comments (6 posted)

On July 21, Savoir-faire Linux (SFL) announced the release of version 1.0 of its Ring communication tool. It is a cross-platform (Linux, Android, macOS, and Windows) program for secure text, audio, and video communication. Beyond that, though, it is part of the GNU project and is licensed under the GPLv3. Given the announcement, it seemed like a quick trial was in order. While it looks like it has great promise, Ring 1.0 falls a bit short of expectations.

Privacy and security are two of the main attributes that Ring is striving for. To start with, Ring provides a peer-to-peer architecture that avoids a central server, which is done to maintain the privacy of the participants. The data is encrypted between the endpoints to thwart those in the middle who might want to listen in. Ring evolved from the SFLphone project, but moved away from SFLphone's centralized architecture, which is part of why the name has changed.

The network is coordinated via a distributed hash table (DHT) that provides distributed key-value data storage. Ring uses the OpenDHT library to implement its hash table, which can store signed and encrypted data using public-key cryptography. Operations like calling a user or listening for incoming calls are coordinated via entries into the DHT as described in the rather terse technical overview on the Ring wiki. In addition, there is more information about OpenDHT in an SFL blog post.

There is also an experimental blockchain-based name server. This "RingNS" server uses the Ethereum blockchain and maps a username to a RingID, which is what identifies a Ring user. The RingID is an SHA-1 fingerprint of the public key of the user. The RSA key pair for the user must be at least 4096 bits long. A bit more information about the use of the blockchain can be found in a blog post from November 2016. That was a busy month for the project, as it became an official GNU package and released its second beta version then.

The RingIDs are not public, so users must exchange them (or usernames associated with them) in order to communicate. The RingID provides anonymity, if desired, as well as privacy, since a user cannot be contacted without using that ID. For users that don't have (or don't want) usernames, the Android app offers a QR-code mechanism to avoid exchanging 40 hex digits. The QR code can be scanned by an associate or the ID can be entered by hand.

I tested the Android app with a certain grumpy editor that I know. The text messaging function worked well, if a bit slowly, once we had established connectivity via our usernames. Video and audio calling, on the other hand, were not functional at all—a bit of video or a still image would occasionally slip through, but audio never made it. The "1.0" version number may be a bit misleading at this point.

Contributions are welcome, of course. The source code is managed in a Gerrit instance, but is also mirrored in the SFL GitHub repositories. There is also a mailing list for those interested.

There are official downloads available for Linux and Android, though the Google Play Store (or F-Droid once it gets updated) may be simpler for Android. Packages for Debian 9, three Ubuntu releases (16.04, 17.04, and 17.10), and two Fedora releases (25 and 26) are available. The community has contributed packages for Arch Linux and openSUSE, as well. Beyond that, packages for Windows (7, 8, 8.1, and 10) and macOS (10.10 and higher) are available too. Notably, there is no iOS version, nor any mention of why; it may be due to the GPLv3 license not being particularly welcome in Apple's app store.

As with other communication (and social networking) applications, the network effect is an important consideration. If the person you are trying to reach is not using Ring, it will be impossible to do so securely using the app (though it does have unencrypted SIP capability). Ring is also fairly new and has not been studied thoroughly (yet, hopefully), so any privacy claims are premature. It is nice to see a free software, privacy-focused communication tool, however; it certainly has the potential to be an important piece of the free-software toolbox.

Comments (6 posted)