This edition contains the following feature content:

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.

Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

During KubeCon + CloudNativeCon Europe 2018, Justin Cormack and Nassim Eddequiouaq presented a proposal to simplify the setting of security parameters for containerized applications. Containers depend on a large set of intricate security primitives that can have weird interactions. Because they are so hard to use, people often just turn the whole thing off. The goal of the proposal is to make those controls easier to understand and use; it is partly inspired by mobile apps on iOS and Android platforms, an idea that trickled back into Microsoft and Apple desktops. The time seems ripe to improve the field of container security, which is in desperate need of simpler controls.

The problem with container security

Cormack first stated that container security is too complicated. His slides stated bluntly that "unusable security is not security" and he pleaded for simpler container security mechanisms with clear guarantees for users.

"Container security" is a catchphrase that actually includes all sorts of measures, some of which we have previously covered. Cormack presented an overview of those mechanisms, including capabilities, seccomp, AppArmor, SELinux, namespaces, control groups — the list goes on. He showed how docker run --help has a "ridiculously large number of options"; there are around one hundred on my machine, with about fifteen just for security mechanisms. He said that "most developers don't know how to actually apply those mechanisms to make sure their containers are secure". In the best-case scenario, some people may know what the options are, but in most cases people don't actually understand each mechanism in detail.

He gave the example of capabilities; there are about forty possible values that can be provided for the --cap-drop option, each with its own meaning. He described some capabilities as "understandable", but said that others end up in overly broad boxes. The kernel's data structure limits the system to a maximum of 64 capabilities, so a bunch of functionality was lumped together into CAP_SYS_ADMIN , he said.

Cormack also talked about namespaces and seccomp. While there are fewer namespaces than capabilities, he said that "it's very unclear for a general user what their security properties are". For example, "some combinations of capabilities and namespaces will let you escape from a container, and other ones don't". He also described seccomp as a "long JSON file" as that's the way Kubernetes configures it. Even though he said those files could "usefully be even more complicated" and said that the files are "very difficult to write".

Cormack stopped his enumeration there, but the same applies to the other mechanisms. He said that while developers could sit down and write those policies for their application by hand, it's a real mess and makes their heads explode. So instead developers run their containers in --privileged mode. It works, but it disables all the nice security mechanisms that the container abstraction provides. This is why "containers do not contain", as Dan Walsh famously quipped.

Introducing entitlements

There must be a better way. Eddequiouaq proposed this simple idea: "provide something humans can actually understand without diving into code or possibly even without reading documentation". The solution proposed by the Docker security team is "entitlements": the ability for users to choose simple permissions on the command line. Eddequiouaq said that application users and developers alike don't need to understand the low-level security mechanisms or how they interact within the kernel; "people don't care about that, they want to make sure their app is secure."

Entitlements divide resources into meaningful domains like "network", "security", or "host resources" (like devices). Behind the scenes, Docker translates those into whatever security mechanisms are available. This implies that the actual mechanism deployed will vary between runtimes, depending on the implementation. For example, a "confined" network access might mean a seccomp filter blocking all networking-related system calls except socket(AF_UNIX|AF_LOCAL) along with dropping network-related capabilities. AppArmor will deny network on some platforms while SELinux would do similar enforcement on others.

Eddequiouaq said the complexity of implementing those mechanisms is the responsibility of platform developers. Image developers can ship entitlement lists along with container images created with a regular docker build , and sign the whole bundle with docker trust . Because entitlements do not specify explicit low-level mechanisms, the resulting image is portable to different runtimes without change. Such portability helps Kubernetes on non-Linux platforms do its job.

Entitlements shift the responsibility for configuring sandboxing environments to image developers, but also empowers them to deliver security mechanisms directly to end users. Developers are the ones with the best knowledge about what their applications should or should not be doing. Image end-users, in turn, benefit from verifiable security properties delivered by the bundles and the expertise of image developers when they docker pull and run those images.

Eddequiouaq gave a demo of the community's nemesis: Docker inside Docker (DinD). He picked that use case because it requires a lot of privileges, which usually means using the dreaded --privileged flag. With the entitlements patch, he was able to run DinD with network.admin , security.admin , and host.devices.admin , which looks like --privileged , but actually means some protections are still in place. According to Eddequiouaq, "everything works and we didn't have to disable all the seccomp and AppArmor profiles". He also gave a demo of how to build an image and demonstrated how docker inspect shows the entitlements bundled inside the image. With such an image, docker run starts a DinD image without any special flags. That requires a way to trust the content publisher because suddenly images can elevate their own privileges without the caller specifying anything on the Docker command line.

Goals and future

The specification aims to provide the best user experience possible, so that people actually start using the security mechanisms provided by the platforms instead of opting out of security configurations when they get a "permission denied" error. Eddequiouaq said that Docker eventually wants to "ditch the --privileged flag because it is really a bad habit". Instead, applications should run with the least privileges they need. He said that "this is not the case; currently, everyone works with defaults that work with 95% of the applications out there." Those Docker defaults, he said, provide a "way too big attack surface".

Eddequiouaq opened the door for developers to define custom entitlements because "it's hard to come up with a set that will cover all needs". One way the team thought of dealing with that uncertainty is to have versions of the specification but it is unclear how that would work in practice. Would the version be in the entitlement labels (e.g. network-v1.admin ), or out of band?

Another feature proposed is the control of API access and service-to-service communication in the security profile. This is something that's actually available on phones, where an app can only talk with a specific set of services. But that is also relevant to containers in Kubernetes clusters as administrators often need to restrict network access with more granularity than the "open/filter/close" options. An example of such policy could allow the "web" container to talk with the "database" container, although it might be difficult to specify such high-level policies in practice.

While entitlements are now implemented in Docker as a proof of concept, Kubernetes has the same usability issues as Docker so the ultimate goal is to get entitlements working in Kubernetes runtimes directly. Indeed, its PodSecurityPolicy maps (almost) one-to-one with the Docker security flags. But as we have previously reported, another challenge in Kubernetes security is that the security models of Kubernetes and Docker are not exactly identical.

Eddequiouaq said that entitlements could help share best security policies for a pod in Kubernetes. He proposed that such configuration would happen through the SecurityContext object. Another way would be an admission controller that would avoid conflicts between the entitlements in the image and existing SecurityContext profiles already configured in the cluster. There are two possible approaches in that case: the rules from the entitlements could expand the existing configuration or restrict it where the existing configuration becomes a default. The problem here is that the pod's SecurityContext already provides a widely deployed way to configure security mechanisms, even if it's not portable or easy to share, so the proposal shouldn't break existing configurations. There is work in progress in Docker to allow inheriting entitlements within a Dockerfile. Eddequiouaq proposed that Kubernetes should implement a simple mechanism to inherit entitlements from images in the admission controller.

The Docker security team wants to create a "widely adopted standard" supported by Docker swarm, Kubernetes, or any container scheduler. But it's still unclear how deep into the Kubernetes stack entitlements belong. In the team's current implementation, Docker translates entitlements into the security mechanisms right before calling its runtime (containerd), but it might be possible to push the entitlements concept straight into the runtime itself, as it knows best how the platform operates.

Some readers might also notice fundamental similarities between this and other mechanisms such as OpenBSD's pledge() , which made me wonder if entitlements belong in user space in the first place. Cormack observed that seccomp was such a "pain to work with to do complicated policies". He said that having eBPF seccomp filters would make it easier to deal with conflicts between policies and also mentioned the work done on the Checmate and Landlock security modules as interesting avenues to explore. It seems that none of those kernel mechanisms are ready for prime time, at least not to the point that Docker can use them in production. Eddequiouaq said that the proposal was open to changes and discussion so this is all work in progress at this stage. The next steps are to make a proposal to the Kubernetes community before working on an actual implementation outside of Docker.

I have found the core idea of protecting users from all the complicated stuff in container security interesting. It is a recurring theme in container security; we've previously discussed proposals to add container identifiers in the kernel directly for example. Everyone knows security is sensitive and important in Kubernetes, yet doing it correctly is hard. This is a recipe for disaster, which has struck in high profile cases recently. Hopefully having such easier and cleaner mechanisms will help users, developers, and administrators alike.

A YouTube video and slides [PDF] of the talk are available.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Comments (8 posted)

In February, the bpfilter mechanism was first posted to the mailing lists. Bpfilter is meant to be a replacement for the current in-kernel firewall/packet-filtering code. It provides little functionality itself; instead, it creates a set of hooks that can run BPF programs to make the packet-filtering decisions. A version of that patch set has been merged into the net-next tree for 4.18. It will not be replacing any existing packet filters in its current form, but it does feature a significant change to one of its more controversial features: the new user-mode helper mechanism.

The core motivation behind bpfilter is performance. An in-kernel, general-purpose packet filter must necessarily offer a wide range of features; any given site will probably only use a small subset of those features. The result is a lot of unused code and time spent checking for whether a given feature is in use, slowing the whole thing down. A packet-filtering configuration expressed as a BPF program, instead, contains only the code needed to implement the desired policy. Once that code is translated to native code by the just-in-time compiler, it should be both compact and fast. The networking developers hope that it will be fast enough to win back some of the users who have moved to proprietary user-space filtering implementations.

If bpfilter is to replace netfilter, though, it must provide ABI compatibility so that existing configurations continue to work. To that end, the bpfilter developers intend to implement the current netfilter configuration protocol; bpfilter will accept iptables rules and compile them to BPF transparently. That compilation is not a trivial task, though, and one that could present some security challenges, so the desire is to do it in user space, but under kernel control.

To make that possible, the initial proposal included a new type of kernel module. Rather than containing kernel code, it contained a normal ELF executable that would be run as a special type of kernel thread. Using the module mechanism allowed this code to be packaged and built with the rest of the kernel; user-mode modules could also be subjected to the same signing rules. There were a number of concerns about how these modules worked, though, which led to some significant changes this time around.

For example, the user-mode helper code is no longer packaged as a module. It is, instead, a blob of code that is built into a normal kernel subsystem (which may be built into the kernel image or packaged as a module). In the sample code, the user-mode component is built as a separate program then, in a process involving "quite a bit of objcopy and Makefile magic", it is turned into an ordinary object file that can be linked into the bpfilter.ko kernel module.

Kernel code that wants to run a blob of code in user space will do so using the new helper code. That is done by calling:

int fork_usermode_blob(void *data, size_t len, struct umh_info *info);

where data points to the code to be run, and len is the length of that code in bytes. The info structure is defined as:

struct umh_info { struct file *pipe_to_umh; struct file *pipe_from_umh; pid_t pid; };

Assuming the user-space process is successfully created, the kernel will place its process ID into pid . The kernel will also create a pair of pipes for communicating with the new process; they will be stored in pipe_to_umh (for writing) and pipe_from_umh (for reading). The code itself is copied into a tmpfs file and executed from there; that allows it to be paged if needed. If the built-in copy of the code is marked as "initdata" (and thus placed in a different section), the caller can free it once the helper is running.

Kernel code that creates this type of helper process must take care to clean it up when the time comes. The process ID can be used to kill the process, and the pipes need to be closed.

The bpfilter module itself, as found in 4.18, does not actually do much. It creates the helper process and can pass a couple of no-op commands to it, but there is no packet-filtering machinery in place yet. That code exists (and has been posted recently) but has evidently been held back to give the user-mode helper mechanism a cycle to stabilize. Bpfilter is thus starting small, but it may have a big impact in the end; perhaps that's why Dave Miller said "let the madness begin" as he merged the code for 4.18.

The replacement of netfilter, even if it happens as expected, will take years to play out, but we may see a number of interesting uses of the new user-mode helper mechanism before then. The kernel has just gained a way to easily sandbox code that is carrying out complex tasks and which does not need to be running in a privileged mode; it doesn't take much effort to think of other settings where this ability could be used to isolate scary code. Just be careful not to call the result a "microkernel" or people might get upset.

Comments (18 posted)

The advent of user namespaces and container technology has made it possible to extend more root-like powers to unprivileged users in a (we hope) safe way. One remaining sticking point is the mounting of filesystems, which has long been fraught with security problems. Work has been proceeding to allow such mounts for years, and it has gotten a little closer with the posting of a patch series intended for the 4.18 kernel. But, as an unrelated discussion has made clear, truly safe unprivileged filesystem mounting is still a rather distant prospect — at least, if one wants to do it in the kernel.

Attempts to make the mount operation safe for ordinary users are nothing new; LWN covered one patch set back in 2008. That work was never merged, but the effort to allow unprivileged mounts picked up in 2015, when Eric Biederman (along with others, Seth Forshee in particular) got serious about allowing user namespaces to perform filesystem mounts. The initial work was merged in 2016 for the 4.8 kernel, but it was known to not be a complete solution to the problem, so most filesystems can still only be mounted by users who are privileged in the initial namespace.

Biederman has recently posted a new patch set "wrapping up" support for unprivileged mounts. It takes care of a number of details, such as allowing the creation of device nodes on filesystems mounted in user namespaces — an action that is deemed to be safe because the kernel will not recognize device nodes on such filesystems. He clearly thinks that this feature is getting closer to being ready for more general use.

The plan is not to allow the unprivileged mounting of any filesystem, though. Only filesystem types that have been explicitly marked as being safe for mounting in this mode will be allowed. The intended use case is evidently to allow mounting of filesystems via the FUSE mechanism, meaning that the actual implementation will be running in user space. That should shield the kernel from vulnerabilities in the filesystem code itself, which turns out to be a good thing.

In a separate discussion, the "syzbot" fuzzing project recently reported a problem with the XFS filesystem; syzbot has been doing some fuzzing of on-disk data and a number of bugs have turned up as a result. In this case, though, XFS developer Dave Chinner explained that the problem would not be fixed. It is a known problem that only affects an older ("version 4") on-disk format and which can only be defended against at the cost of breaking an unknown (but large) number of otherwise working filesystems. Beyond that, XFS development is focused on the version 5 format, which has checksumming and other mechanisms that catch most metadata corruption problems.

There was an extensive discussion over whether the XFS developers are taking the right approach, but it took a bit of a diversion after Eric Sandeen complained about bugs that involve "merely mounting a crafted filesystem that in reality would never (until the heat death of the universe) corrupt itself into that state on its own". Ted Ts'o pointed out that such filesystems (and the associated crashes) can indeed come about in real life if an attacker creates one and somehow convinces the system to mount it. He named Fedora and Chrome OS as two systems that facilitate this kind of attack by automatically mounting filesystems found on removable media — USB devices, for example.

There is a certain class of user that enjoys the convenience of automatically mounted filesystems, of course. There is also the container use case, where there are good reasons for allowing unprivileged users to mount filesystems on their own. So, one might think, it is important to fix all of the bugs associated with on-disk format corruption to make this safe. Chinner has bad news for anybody who is waiting for that to happen, though:

There's little we can do to prevent people from exploiting flaws in the filesystem's on-disk format. No filesystem has robust, exhaustive verification of all it's metadata, nor is that something we can really check at runtime due to the complexity and overhead of runtime checking.

Many types of corruption can be caught with checksums and such. Other types are more subtle, though; Chinner mentioned linking important metadata blocks into an ordinary file as an example. Defending the system fully against such attacks would be difficult to do, to say the least, and would likely slow the filesystem to a crawl. That said, Chinner doesn't expect distributors like Fedora to stop mounting filesystems automatically: "They'll do that when we provide them with a safe, easy to use solution to the problem. This is our problem to solve, not blame-shift it away." That, obviously, leaves open the question of how to solve a problem that has just been described as unsolvable.

To Chinner, the answer is clear, at least in general terms: "We've learnt this lesson the hard way over and over again: don't parse untrusted input in privileged contexts". The meaning is that, if the contents of a particular filesystem image are not trusted (they come from an unprivileged user, for example), that filesystem should not be managed in kernel space. In other words, FUSE should be the mechanism of choice for any sort of unprivileged mount operation.

Ts'o protested that FUSE is "a pretty terrible security boundary" and that it lacks support for many important filesystem types. But FUSE is what we have for now, and it does move the handling of untrusted filesystems out of the kernel. The fusefs-lkl module (which seems to lack a web site of its own, but is built using the Linux kernel library project) makes any kernel-supported filesystem accessible via FUSE.

When asked (by Ts'o) about making unprivileged filesystem mounts safe, Biederman made it clear that he, too, doesn't expect most kernel filesystems to be safe to use in this mode anytime soon:

Right now my practical goal is to be able to say: "Go run your filesystem in userspace with fuse if you want stronger security guarantees." I think that will be enough to make removable media reasonably safe from privilege escalation attacks.

It would thus seem that there is a reasonably well understood path toward finally allowing unprivileged users to mount filesystems without threatening the integrity of the system as a whole. There is clearly some work yet to be done to fit all of the pieces together. Once that is done, we may finally have a solution to a problem that developers have been working on for at least a decade.

Comments (22 posted)

Suppose you have a program running on your system that you don't quite trust. Maybe it's a program submitted by a student to an automated grading system. Or maybe it's a QEMU device model running in a Xen control domain ("domain 0" or "dom0"), and you want to make sure that even if an attacker from a rogue virtual machine manages to take over the QEMU process, they can't do any further harm. There are many things you want to do as far as restricting its ability to do mischief. But one thing in particular you probably want to do is to be able to reliably kill the process once you think it should be done. This turns out to be quite a bit more tricky than you'd think.

Avoiding kill with fork

So here's our puzzle. Suppose we have a process that we've run with its own individual user ID (UID), which we want to kill. But the code in the process is currently controlled by an attacker who doesn't want it to be killed.

We obviously know the process ID (PID) of the initial process we forked, so we could just use the kill() system call:

kill(pid, 9);

So how can an attacker avoid this? It turns out to be pretty simple:

while(1) { if (fork()) _exit(0); }

This simple snippet of code will repeatedly call fork() . As you probably know, fork() returns twice: once in the existing parent process (returning the PID of the newly-created child), and once in a newly-created child process (returning 0 ). In the loop above, the parent will always call _exit() , and the child will call fork() again. The result is that the program races through the process ID space as fast as the kernel will let it. These types of programs are often called "fork bombs". [The author disagrees with this characterization, which was added by an editor late in the publication process.]

I encourage you to run the above code snippet (preferably in a virtual machine), and see what it looks like. It's not even very noticeable. Running top shows a system load of about 50% (in my virtual machine anyway), but there's not obviously any particular process contributing to that load; everything is still responsive and functional. If you didn't know about it, you might never notice it was there.

Now try killing it. You can run killall to try to kill the process by name, but it will frequently fail with "no process killed"; even when it succeeds, it often turns out that you've killed the parent process after the fork() but before the _exit() , so the rogue forking process is still going strong. Even determining whether you've managed to kill the process or not is a challenge.

The basic problem here is a race condition. What killall does is:

Read the list of processes, looking for one with the specified name Call kill(pid, sig) on each one found

In between 1 and each instance of 2, the kernel tasklist lock is released (since it has to return from the system call), giving the rogue process a chance to fork. Indeed, it has many chances; since the second step takes a non-negligible amount of time, by the time you manage to find the rogue process, it's likely already forked, and perhaps even exited.

It's true, if we ran killall 1000 times, the rogue process would very likely end up dead; and if we ran ps 1000 times, and found no trace of the process, we might be pretty sure that it was gone. On the other hand, that assumes that the "race" is fair, and that the attacker hasn't discovered some way of making sure that the race ends up going their way. It would be best if we didn't rely on these sorts of probabilistic calculations to clean things up.

Better mousetraps?

One thing to do, of course, would be to try to prevent the process from executing fork() in the first place. This could be done on Linux using the seccomp() call; but it's Linux-specific. (Xen, for example, wants to be able to support NetBSD and FreeBSD control domains, so it can't rely on this for correctness.) Another would be to use the setrlimit() system call to set RLIMIT_NPROC to 0 . This should, in theory, prevent the process from calling fork() (since by definition there would already be one process with its user ID running).

But RLIMIT_NPROC has had its own set of issues in the past. Setting it to 0 would also break a lot of perfectly legitimate code. Surely there must be a way to kill a process in a way that it can't evade, without relying on being able to take away fork() . Looking more closely at the kill() man page, it turns out that the pid argument can be interpreted in four possible ways:

pid > 0: PID of a single process to kill

> 0: PID of a single process to kill pid < -1: the negative of the ID of a process group ( pgid ) to kill

< -1: the negative of the ID of a process group ( ) to kill pid == 0: Kill every process in my current process group

== 0: Kill every process in my current process group pid == -1: Kill every process that I'm allowed to kill

At first glance it seems like killing by pgid might do what we want. To run our untrusted process, set the pgid and the user ID; to kill it, we call kill(-pgid, 9) .

Unfortunately, unlike the user ID, the pgid can be changed by unprivileged processes. So our attacker could simply run something like the following to avoid being killed in the same way:

while(1) { if (fork()) _exit(0); setpgid(0, 0); }

In this case, the child process changes its pgid to match its PID as soon as it forks, making kill(-pgid) as racy as kill(pid) .

A better mousetrap: kill -1

What about the last one — "kill every process I'm allowed to kill"? Well we obviously don't want to run that as root unless we want to nuke the entire system; we want to limit "all processes I'm allowed to kill" to the particular user ID we've given to the rogue process.

In general, processes are allowed to kill other processes with their own UID; so what about something like the following?

setuid(uid); kill(-1, 9);

(Note that for simplicity error handling is omitted in these examples; but when playing with kill() you should certainly make sure that you did switch your UID.)

The kill() system call, when called with -1 , will loop over the entire task list, attempting to send the signal to each process except the one making the system call. The tasklist lock is held for the entire loop, so the rogue process cannot complete a fork() ; since the UIDs match, it will be killed.

Done, right? Not quite. If we simply call setuid() , then not only can we kill the rogue process, but the rogue process can also kill us:

while(1) { if (fork()) _exit(0); kill(-1, 9); setpgid(0, 0); }

If the rogue process manages to get its own kill(-1) in after we've called setuid() but before we've called kill() ourselves, we will be the ones to disappear. So to successfully kill the rogue process, we still need to win a race — something we'd rather not rely on.

A better mousetrap: exploiting asymmetry

If we want to reliably kill the other process without putting ourselves at risk of being killed, we must find an asymmetry that allows the "reaper" process to do so. If we look carefully at the kill() man page, we find:

For a process to have permission to send a signal, it must either be privileged (under Linux: have the CAP_KILL capability in the user namespace of the target process), or the real or effective user ID of the sending process must equal the real or saved set-user-ID of the target process.

So there is an asymmetry. Each process has an effective UID ( euid ), real UID ( ruid ), and saved UID ( suid ). For process A to kill process B, A's ruid or euid must match one of B's ruid or suid .

When we started our target process, we set all of its UIDs to a specific value ( target_uid ). Can we construct a <euid, ruid, suid> tuple for our "reaper" process to use that will allow it to kill the rogue process, and no other processes, but not be able to be killed by the rogue process?

It turns out that we can. If we create a new reaper_uid , and set its <euid, ruid, suid> to <target_uid, reaper_uid, X> (where X can be anything as long as it's not target_uid ), then:

The reaper process can kill the target process, since its effective UID is equal to the target process's real UID

But the target process can't kill the reaper, since its real and effective UIDs are different than the real and saved UIDs of the reaper process.

So the following code will safely kill all processes of target_uid in a race-free way:

setresuid(reaper_uid, target_uid, reaper_uid); kill(-1, 9);

Note that this reaper_uid must have no other running processes when we call kill() , or they will be killed as well. In practice this means either setting aside a single reaper_uid (and using a lock to make sure only one reaper process runs at a time) or having a separate reaper_uid per target_uid .

Proof-of-concept code for both the rogue process and the reaper process can be found in this GitHub repository.

No POSIX-compliant mousetraps?

The setresuid() system call is implemented by both Linux and FreeBSD. It is not currently implemented by NetBSD, but implementing it seems like a pretty straightforward exercise (and certainly a lot simpler than implementing seccomp() ). NetBSD does implement RLIMIT_NPROC , which should also be helpful at preventing our process from executing fork() .

On the other hand, neither setresuid() nor RLIMIT_NPROC are in the current POSIX specification. It seems impossible to get a process to have the required tuple using only the current POSIX interfaces (namely setuid() and setreuid() , without recourse to setresuid() or Linux's CAP_SETUID ); the assumption seems to be that euid must always be set to either ruid or suid . So there would seem to be no way within that specification to safely prevent a potentially rogue process from using fork() to evade kill() .

Acknowledgments

Thanks to Ian Jackson for doing the analysis to discover the appropriate <euid, ruid, suid> tuple, as well as confirming my assessment that there is no way to set that tuple using current POSIX interfaces.

Comments (50 posted)

Stratis is a new local storage-management solution for Linux. It can be compared to ZFS, Btrfs, or LVM. Its focus is on simplicity of concepts and ease of use, while giving users access to advanced storage features. Internally, Stratis's implementation favors tight integration of existing components instead of the fully-integrated, in-kernel approach that ZFS and Btrfs use. This has benefits and drawbacks for Stratis, but also greatly decreases the overall time needed to develop a useful and stable initial version, which can then be a base for further improvement in later versions. As the Stratis team lead at Red Hat, I am hoping to raise the profile of the project a bit so that more in our community will have it as an option.

Why make Stratis instead of working on ZFS or Btrfs?

A version of ZFS, originally developed by Sun Microsystems for Solaris (now owned by Oracle), was forked for use on other platforms including Linux (OpenZFS). However, its CDDL-licensed code is not able to be merged into the GPLv2-licensed Linux source tree. Whether CDDL and GPLv2 are truly incompatible is a continuing subject for debate, but the uncertainty is enough to make some enterprise Linux vendors unwilling to adopt and support it.

Btrfs is also well-established, and has no licensing issues. It was the anointed Chosen One for years (and years) for many people, but it is a large project that duplicates existing functionality, with a potentially high cost to complete and support over the long term.

Red Hat ultimately made a choice to instead explore the proposed Stratis solution.

How Stratis is different

Both ZFS and Btrfs can be called "volume-managing filesystems" (VMFs). These combine the filesystem and volume-management layers into one. VMFs focus on managing a pool of storage created from one or more block devices, and allowing the creation of multiple filesystems whose data resides in the pool. This model of management has proven attractive to users, since it makes storage easier to use not only for basic tasks, but also for more advanced features that would otherwise be challenging to set up.

Stratis is also a VMF, but unlike the others, it is not implemented entirely as an in-kernel filesystem. Instead, Stratis is a daemon that manages existing layers of functionality in Linux — the device-mapper (DM) subsystem and the XFS non-VMF filesystem — to achieve a similar result. While these components are not part of Stratis per se (and can indeed be used directly or via LVM) Stratis takes on the entire responsibility for configuring, maintaining, and monitoring the pool's layers on behalf of the user.

Although there are drawbacks to forgoing total integration, there are benefits. The natural primary benefit is that Stratis doesn't need to independently develop and debug the many features a VMF is expected to have. Also, it may be easier to incorporate new capabilities more quickly when they become available. Finally, as a new consumer of these components, Stratis may participate in their common upstream development, sharing mutual benefit with the component's other users.

In addition to this main implementation difference, Stratis also makes some different design choices, based on the current state of technology. First, the widespread use of SSDs minimizes the importance of optimizing for access times on rotational media. If performance is important, SSDs should be used either as primary storage, or as a caching tier for the spinning disks. Assuming this is the case lets Stratis focus more on other requirements in the data storage tier. Second, embedded use and automated deployments are now the norm. A new implementation should include an API from the start, so other programs can also configure it easily. Lastly, storage is starting to become commoditized: big enough for most uses, and perhaps no longer something users want to actively manage. Stratis should account for this by being easy to use. Many people will only interact with Stratis when a problem arises. Poor usability feels even worse when the user is responding to a rare storage alert, and also may be worried about losing data.

Implementation

Stratis is implemented as a user-space daemon, written in the Rust language. It uses D-Bus to present a language-neutral API, and also includes a command-line tool written in Python. The API and command-line interface are focused around the three concepts that a user must be familiar with — blockdevs, pools, and filesystems. A pool is made up of one or more blockdevs (block devices, such as a disk or a disk partition), and then once a pool is created, one or more filesystems can be created from the pool. While the pool has a total size, each filesystem does not have a fixed size.

Under the hood

Although how the pool works internally is not supposed to be a concern of the user, let's look at how the pool is constructed.

Starting from the bottom of the diagram on the "internal view" side, the layers that manage block devices and add value to them are called the Backstore, which is in turn divided into data and cache tiers. Stratis 1.0 will support a basic set of layers, and then additional optional layers are planned for integration that will add more capabilities.

The lowest layer of the data tier is the blockdev layer, which is responsible for initializing and maintaining on-disk metadata regions that are created when a block device is added to a pool. Above that may go support for additional layers, such as providing detection of data corruption (dm-integrity), and providing data redundancy (dm-raid), with the ability to correct data corruption when used in tandem. This would also be where support for compression and deduplication, via the recently open-sourced (but not yet upstream) dm-vdo target, would sit. Since these reduce the available total capacity of the pool and may affect performance, their use will be configurable at time of pool creation.

Above this is the cache tier. This tier manages its own set of higher-performance block devices, to act as a non-volatile cache for the data tier. It uses the dm-cache target, but its internal management of blockdevs used for cache is similar to the data tier's management of blockdevs.

On top of the Backstore sits the Thinpool, which encompasses the data and metadata required for the thin-provisioned storage pool that individual filesystems are created from. Using dm-thin, Stratis creates thin volumes with a large virtual size and formats them with XFS. Since storage blocks are only used as needed, the actual size of a filesystem grows as data is stored on it. If this data's size approaches the filesystem's virtual size, Stratis grows the thin volume and the filesystem automatically. Stratis 1.0 will also periodically reclaim no-longer-used space from filesystems using fstrim , so it can be reused by the Thinpool.

Along with setting up the pool, Stratis continually monitors and maintains it. This includes watching for DM events such as the Thinpool approaching capacity, as well as udev events, for the possible arrival of new blockdevs. Finally, Stratis responds to incoming calls to its D-Bus API. Monitoring is critical because thin-provisioned storage is sensitive to running out of backing storage, and relieving this condition requires intervention from the user, either by adding more storage to the pool or by reducing the total data stored.

Challenges so far

Since Stratis reuses existing kernel components, the Stratis development team's two primary challenges have been determining exactly how to use them, and then encountering cases where the use of components in a new way can raise issues. For example, in implementing the cache tier using dm-cache, the team had to figure out how to use the DM target so that the cache device could be extended if new storage was added. Another example: Snapshotting XFS on a thin volume is fast, but giving the snapshot a new UUID so it can be mounted causes the XFS log to be cleared, which increases the amount of data written.

Both of these were development hurdles, but also mostly expected, given the chosen approach. In the future, when Stratis has proven its worth and has more users and contributors, Stratis developers could also work more with upstream projects to implement and test features that Stratis could then support.

Current Status and How to Get Involved

Stratis version 0.5 was recently released, which added support for snapshots and the cache tier. This is available now for early testing in Fedora 28. Stratis 1.0, which is targeted for release in September 2018, will be the first version suitable for users, and whose on-disk metadata format will be supported in future versions.

Stratis started as a Red Hat engineering project, but has started to attract community involvement, and hopes to attract more. If you're interested in development, testing, or offering other feedback on Stratis, please join us. For learning more about Stratis's current technical details, check out our Design document [PDF] on the web site. There is also a development mailing list.

Development is on GitHub, both for the daemon and the command-line tool. This is also where bugs should be filed. IRC users will find the team on the Freenode network, on channel #stratis-storage. For periodic news, follow StratisStorage on Twitter.

Conclusion

Stratis is a new approach to constructing a volume-managing filesystem whose primary innovation is — ironically — reusing existing components. This accelerates its development timeline, at the cost of foregoing the potential benefits of committing "rampant layering violations". Do the benefits ascribed to ZFS and Btrfs require this integration-of-implementation approach, or are these benefits also possible with integration at a slightly higher level? Stratis aims to answer this question, with the goal of providing a useful new option for local storage management to the Linux community.

Comments (33 posted)

The second Operating-System-Directed Power-Management (OSPM18) Summit took place at the ReTiS Lab of the Scuola Superiore Sant'Anna in Pisa between April 16 and April 18, 2018. Like last year , the summit was organized as a collection of collaborative sessions focused on trying to improve how operating-system-directed power management and the kernel's task scheduler work together to achieve the goal of reducing energy consumption while still meeting performance and latency requirements.

This extensive summary of the event was written by Juri Lelli, Rafael Wysocki, Dietmar Eggemann, Vincent Guittot, Ulf Hansson, Daniel Lezcano, Dhaval Giani, Georgi Djakov, and Joel Fernandes. Editing by Jonathan Corbet.

The count of summit attendees almost doubled from last year (almost maxing out venue limits). The summit again brought together people from academia, open-source maintainership, and industry. The relaxed atmosphere of the venue and the manageable group of attendees allowed for intense debate and face-to-face discussions about current problems. Several proposals or proof-of-concept solutions were presented during the sessions.

The following chapters, clustered by topic areas, summarize the discussions of the individual sessions. Videos of the talks, along with the slides used, can all be found on the OSPM Summit web site.

Software Frameworks

HERCULES: a realtime architecture for low-power embedded systems. Tomasz Kloda presented an overview of the HERCULES project, which UNIMORE is coordinating (other partners are CTU Prague, ETH Zurich, Evidence Srl, AIRBUS, PITOM and Magneti Marelli). The project is studying how to use commercial, off-the-shelf (COTS), multi-core platforms for realtime embedded systems (such as medical devices or autonomous driving). The biggest issue that the project is facing is that COTS systems are designed with little or no attention to worst-case behavior; this makes guaranteeing timeliness difficult (if not unachievable). Tomasz’s group is, in particular, looking at how to avoid memory contention, which can cause unexpected latencies when it happens. To verify the proposed solution, a Jetson TX1 board was chosen as the reference COTS platform and the ERIKA Enterprise RTOS was ported to it. To implement splitting of hardware resources, the team decided to utilize the Jailhouse hypervisor (which has been ported as well). The proposed solution is an implementation of Pellizzoni’s Predictable Execution Model [PDF], it contemplates using both cache coloring and preventive invalidation techniques to achieve the desired determinism.

SCHED_DEADLINE: The Android Audio Pipeline Case Study. Alessio Balsini presented the work he has been doing at ARM as an intern. The question he tried to answer is whether the deadline scheduler ( SCHED_DEADLINE ) might help to improve the performance of the Android audio pipeline. He started his talk with an introduction to the pipeline, describing how sound gets from an app to the speaker in Android. His work focused on Fast Mixer, which is used to service applications with strict latency requirements, including:

Power efficiency: always running at maximum frequency is not good for your battery, so the lowest (while sufficient!) clock frequency has to be selected.

Low-latency: smaller buffer sizes give lower latencies, but might introduce glitches .

Reactive to workload changes: (E.g., virtual instruments, where a variable number of notes might need to be synthesized at any point in time).

He reviewed current and alternative solutions, then stated that using SCHED_DEADLINE with an adaptive bandwidth mechanism implemented by the user-space runtime system is what showed best results. While the periodicity and deadlines of the tasks composing the pipeline are easy to get right, correctly dimensioning the requested CPU time might be tricky. For this reason an adaptive bandwidth mechanism has been implemented that varies runtime using information coming from running statistics and application hints (e.g., number of notes the virtual instrument will have to synthesize in the next cycle).

Workload consolidation by dynamic adaptation. Rohit Jain and Dhaval Giani started the discussion by introducing the issues that the Oracle Database multi-tenancy option faces in trying to partition hardware resources: consolidation of "virtual databases" on common hardware suffers from performance interference. To solve this problem, the presenters stated that the use of both (work-conserving) control-group regulation and exclusive cpusets (to ensure cache affinity) would be needed. It would be best to move toward a more dynamic solution incorporating the good properties of the two mechanisms: share CPUs opportunistically and "re-home" tasks to assigned CPUs when the load increases.

The performance results of a solution based on modifying sched_setaffinity() to accept a "soft affinity" flag were then presented; they showed a promising trend, but were not ideal. The discussion focused on alternative ways to fix the issues; the group suggested that Rohit and Dhaval use (and possibly improve) automatic NUMA balancing, which should do the job already (in conjunction with mbind() ). Rohit and Dhaval will go back and do their experiments again to see if this suggestion would indeed help their case.

Why Android phones don't run mainline Linux (yet) Todd Kjos kindly agreed to cover Chris Redpath’s discussion slot about why Android phones can’t run mainline Linux kernels. Using a set of slides from Google Bootcamp 2018, Todd started the session highlighting the fact that, since last year, the Android Common Kernel is a properly maintained (upstream merges, CI testing, etc.) downstream of the long-term support (LTS) kernels. There are, however, three classes of out-of-tree patches that have to be maintained in Android Common:

Android-specific features, such as the interactive cpufreq governor, which is not actually used anymore as Android has switched to use schedutil.

Features that were proposed for upstream, but were rejected (they are going to be reevaluated later this year).

Features that are ready for Android, but still under discussion for mainline inclusion (the energy-aware scheduler — EAS — for example) and are expected to be merged relatively soon.

A diffstat-like slide, with respect to long-term support (LTS) kernels, was then presented that showed how the biggest downstream feature is EAS, but the hope is that next year this figure will change as more and more EAS pieces are merged. Networking comes next, followed by filesystem and audio-related changes, and several additional bits and pieces (architecture support, drivers, debugging, and emulators). But these out-of-tree patches aren't the main inhibitor to running mainline linux on a phone.

Branching strategy was then presented. At the top of the hierarchy is the LTS kernel, which merges down into the Android Common Kernel (this year's generation of phones runs 4.9). Then silicon vendors create their kernels from the Android Common Kernel when starting a development cycle for a new device, then put all SoC-specific code (plus debugging and instrumentation) on top of it. This process takes about one year, then the result is used by partners and OEMs who contribute additional changes before releasing the resulting kernel, running on phones, another six months later. Considering that the normal support length for an LTS kernel is two years, we are left with only six months of support after the initial phones based on an SoC hit the market. There has been an effort evangelizing the idea that SoC/OEM patches have to be easily rebasable on top of a newer LTS; however there is still a lot of work to be done to accomplish this goal. The general situation has improved a lot in the last couple of years with efforts like Project Treble.

Energy-aware realtime scheduling: Parallel or sequential? From analysis to implementation Houssam-Eddine Zahaf presented the work he has been doing (while working at Université de Lille) regarding energy-awareness tradeoffs on realtime systems. He introduced the topic by giving the audience two examples of common pitfalls (and possible solutions) when running realtime tasks on SMP platforms or platforms that implement voltage and frequency-scaling techniques. The former can leverage task migrations to achieve higher system utilization (while still meeting deadlines); the latter can save energy (on certain hardware) by executing workloads at a lower clock frequency.

He then moved on by stating that the very first thing to do when trying to intelligently save energy is to derive an energy model for the platform of interest. This model might easly get quite complex however, as it depends in general on the specific application, memory usage, type of operations performed, and task composition. He concluded by showing details about a possible formalization of the problem that can accept task parameters, the desired parallelization level, and energy model as inputs and provide allocation of threads on the different cores and frequency selection hints as output.

FAIR and RT Scheduler

What is still missing in load tracking? Vincent Guittot presented the evolution of the load tracking mechanism in the Linux scheduler and what should be the next steps. The session was split into three parts. The first part showed the improvements made in scheduler load tracking since last OSPM summit and listed the features that have already been merged. The audience agreed that new load tracking was far more accurate, stable, and helpful in scheduler load balancing.

Vincent then described what still remains to be fixed, like the case of realtime tasks preempting ordinary tasks. There is also a desire to remove the current rt_avg mechanism and to replace it with the new load-tracking information. Based on this use case, the discussion extended to the definition of CPU utilization and what is needed to get a complete view. We already track ordinary task utilization, and we had seen with the previous use case that tracking realtime utilization is beneficial. The audience agreed that we should extend that to account for interrupt pressure and SCHED_DEADLINE usage to get a complete view of the utilization.

Then, we discussed the load-tracking mechanism; the current implementation is simple and efficient but has some drawbacks, including capping the value to the current capacity of the CPU, which makes the utilization not fully invariant as shown in some examples. After describing which kind of load-tracking behavior we would like, Vincent raised the question of what we really want to track. It’s not really the running time (even after the latter has been scaled); instead we are more interested by the amount of work that has been executed (how many instructions). Some people from Intel said that they have an implementation of scaling invariance, but it has problems with the current scaling invariance implementation: when using APERF/MPERF, the arch_scale_freq is often lower than 1024 (max value), which caps the current utilization and decreases the targeted frequency. As a result, the utilization decreases enters a decreasing feedback effect. Tracking the work and removing the capping effect should help to fix this kind of problem. ARM developers mentioned that they would be interested too, because they have some new performance counters that could be used for tracking the utilization of CPUs.

The last part of the session discussed the performance impact of load tracking. A short and hackish test has shown an impact of 5% in the sched pipe benchmark result. The audience agreed that 5% is not negligible, but this overhead gives a benefit thanks to a better load balancing. The result has been shown to raise the discussion whether it’s a real problem or not and whether we should look at optimizing PELT.

Status update on Utilization Clamping. Patrick Bellasi gave a status update about his proposal for implementing utilization clamping. The proposal is not new and has been extensively discussed at last year's OSPM and Linux Plumbers Conference; however the implementation has changed considerably with respect to Patrick’s first shot at it.

The discussion started by recalling that there exist systems that are managed by informed runtimes, Android and ChromeOS in particular, that have a lot of information about which applications are running and how to properly tune the system to best suit their needs by, for example, trading off performance against energy consumption. Transient optimizations are also possible, depending on the state an application is in: whether it is running in the foreground or background on Android systems, for example.

Considering that task utilization is already driving clock frequency selection in mainline kernels via the schedutil cpufreq governor, and that EAS further extends its usage to drive load-balancing decisions, being able to act upon such signal by clamping it from userspace/middleware is thought to be a powerful means to perform fine-grained dynamic optimizations.

The latest util_clamp implementation provides both a per-task and a control-group interface. The per-task API extends the sched_attr structure with a couple of parameters ( util_min and util_max ) and can be used via the existing sched_setattr() syscall. Scheduler maintainer Peter Zijlstra was positive about the viability of such an extension.

The control-group interface is still an extension of the CPU controller, like in the previous proposals, but it’s now considered a secondary interface, as requested by Tejun Heo. Since Tejun was not fully convinced about the names proposed by the new attributes, it has been suggested that Patrick should mention all the possible usages and link this proposal to some possible future uses, such as better support for task placement. A brief discussion also concerned how to properly aggregate clamped utilization values coming from different scheduling classes. The realtime scheduling class might benefit from such a solution, especially on battery-constrained devices. The general consensus was that aggregation should at least be consistent with how it is going to be used by the load balancer.

arm64 topology representation Morten Rasmussen started his session by saying that the discussion he wanted to have was driven by challenges and issues he had to deal with on ARM64 platforms. Before getting into the details, however, he gave a gentle introduction to the Linux topology ("scheduling domains") setup code as of today and described the hierarchical nature of this topology representation. Each topology level has flags associated to it, and this information influences scheduler decisions. This all works well for ARM64 mobile systems.

Problems arise with ARM64 servers, though. Such systems have lots of clusters in a single package and, without changes to the topology setup, task wakeups will be confined to single clusters instead of being potentially spread to the package. The good news is that packages have a shared L3 cache, so changing how flags are attributed to the domains might easily fix the problems. Even for systems that don’t come with a physically shared L3 cache, it might still be worthwhile to balance across the whole package, as the cache-coherence interconnect makes data sharing fast. So, setting or not setting the desired flags seems to be up to where one wants it to "hurt" (as commented by Peter Zijlstra). In any case, Peter was adamant in stating that "if there is an L3, then the multi-core scheduling-domain level must span the L3".

A discussion followed about how load balancing is using the flags at the moment and what kind of topology setup is used to build the hierarchy. At this point Morten noted that there is one addition for the ARM64 world based on PPTT ACPI tables (currently driven towards mainline adoption by Jeremy Linton): a tree representation of caches and associated flags. The end goal in this case would be to end up with the same topology if one describes the same platform either using PPTT or a device tree.

Whichever form the final solution takes, it was noticed that being backward compatible might be important. Having some sort of flag for deciding whether to go "the old way" or to adopt the new solution(s) might be worthwhile to avoid any sort of backward-compatibility problems.

EAS in Mainline: where we are. Quentin Perret and Dietmar Eggemann presented the latest energy-aware scheduling patch set that has been posted on the linux-kernel list. A previous version, sent out a couple of weeks earlier, was already covered by LWN. There was agreement that starting with the simplest energy model, representing only the active power of CPUs at all available P-states, is the right thing to do.

The design decision to only enable energy-aware scheduling on systems with asymmetric CPU capacity seems to be correct. Another important design factor is that the energy model can assume that all CPUs of a frequency domain have the same micro-architecture. Rafael Wysocki and Peter Zijlstra required that the energy model must be an independent framework which can be used on all architectures and by multiple users (such as the scheduler, cpufreq or a thermal cooling device). Therefore the current coupling of the energy model with the PM_OPP library has to be abandoned, since there are platforms which don’t expose their P-states.

The fact that the energy-aware scheduler iterates over every CPU in the system is acceptable as long as the implementation will warn or bail in case the number of CPUs and frequency domains exceeds a certain threshold (probably eight or 16).

The idea that energy-aware scheduling only makes sense if the system is not over-utilized and the appropriate implementation found agreement, although it is not clear if the current per-scheduling-domain approach is really beneficial or if the easier system-wide implementation should be preferred.

Power-aware and capacity-aware migrations for realtime tasks. Luca Abeni made use of the time allocated for his slot to review and discuss the main issues he found while trying to make SCHED_DEADLINE aware of CPU compute capacities, something similar to what the mainline completely fair scheduling (CFS) load balancer already does. After giving the audience a brief introduction to SMP scheduling on realtime systems, he stated that SMP global invariants might be wrong when CPUs don’t have the same computational power (capacity). This issue can come up on big.LITTLE systems or traditional SMP systems with CPUs running at different clock frequencies.

He aims at modifying the SCHED_DEADLINE code that controls task migrations to make it aware of both CPU-capacity and operating-frequency differences. Existing theory is however of little help, as the global earliest-deadline first scheduling algorithm doesn't take into account CPU utilization, and theoretical algorithms may lead to an unacceptable number of task migrations. He thus decided to follow a more practical approach (leaving theory as a second step). The idea would be to modify the cpudl_find() function (responsible for finding a potential destination for migrating a task) to reject CPUs without enough spare capacity. He has already implemented a further simplification of the solution that, for the time being, looks at completely free CPUs with suitable capacity (when running at maximum frequency) to accommodate a task's needs.

At this point an interesting discussion started, focusing on which kinds of parameters are to be considered when evaluating a task's needs. One option would be static utilization (given at admission control time) which never changes but might be pessimistic; the alternative is dynamic utilization that takes into account a task's execution history before the migration decision happens. A decision has not been made, and discussion has been postponed to when actual patches hit the list, but a point worth of note from the discussion is that rq_clock() can be considered always in sync for all CPUs and thus could be potentially used to implement the latter (dynamic) solution.

Towards a constant time search for wake up fast path. Dhaval Giani and Rohit Jain discussed a task wakeup fast-path optimization to make it more scalable. The goal is to make the time needed to find a CPU to run a new task on independent of the system size (or O(1) ideally). The approach explored so far is based on counting the threads running on each CPU core to pick the least loaded one. Alternatively, a limit on the search space could be introduced that would make it a constant-time search. The recommendation from the audience was to post the alternative patch for review and run some tests to compare the two solutions.

System Frameworks

Towards a tighter integration of the system-wide and runtime PM of devices Rafael Wysocki started his talk by giving the audience an introduction to device power management in Linux. In order of complexity, he reviewed working versus sleeping system states, device runtime power management, and the system suspend, hibernation and restore control flows. He then moved on to look at how the power-management code is organized; it consists of several layers of software including devices (with drivers), bus types and power-management domains. This complexity is needed to make devices work on different bus types and across power-management-domain topologies, implementing power management on different platforms while avoiding code duplication.

While most cases are already handled correctly, some questions still remain open:

Can devices in runtime suspend be left suspended during system suspend (or hibernation)?

Can devices be left in suspend during system resume then be managed by runtime power management in the working state?

Can runtime power-management callbacks be used for system suspend and resume and hibernation?

After presenting the existing solutions, previous attempts, and new ideas for these problems, he asked the audience for further ideas. Peter Zijlstra proposed to implement a "unified" state machine to reduce complexity, which might be considered as a solution even if it looks like a difficult one to get right. A need to support new bus standards (e.g., mesh topologies) was also mentioned and discussed.

Integration of system-wide and runtime device power management: what are the requirements for a common solution? This session, led by Ulf Hansson, was a continuation of Rafael’s session; the first part focused on the issues related to device wakeup handling. Generally, device wakeup is about powering up the device to handle external events when the device is in a low-power state, the entire system is in a sleep state, or both (in the majority of cases, devices are in low-power states if the system is in a sleep state). That can be done in a few different ways: through in-band device interrupts, via a special signal line or GPIO pin (out of band), or through a standard wakeup signaling means like PME# in PCI (also regarded as in-band). There are devices where two or more methods are available and each of them may require a different configuration. Moreover, the set of configurations in which a device can signal wakeup may be limited (for example, in-band wakeup signaling may not be available in all of the possible low-power states of the device) and that may depend on the physical design of the board (for example, it may depend on whether or not the device's wakeup GPIO line is present).

The runtime power-management framework assumes that device wakeup will be set up for all devices in runtime suspend. The device driver and middle-layer code (e.g. bus type or power-management domain) involved are then expected to figure out the most suitable configuration to put the device into on runtime suspend. For system sleep states, there is another bit of information to take into account because user space may disallow devices from waking up the entire system, via sysfs. Still, all of that doesn't tell the driver and the middle layer which wakeup signaling method is to be used and which methods are actually viable.

On systems with more complex platform firmware, like ACPI or PSCI, this mostly is handled by that firmware, or at least the firmware can tell the kernel what to do. However, on systems relying on a device tree only, there is no common way to describe the possible device wakeup settings. The response to that from the session audience was that this seems to be a device-tree issue and it should be addressed by creating appropriate bindings for device wakeup. Still, though, the kernel is currently missing a common framework to use the firmware-provided information on device wakeup in a generic way, independent of what the platform firmware is. Whether or not having such a framework would be useful and to what extent is an open question at this time.

The second part of the Ulf's session focused on the generic power domains (genpd) framework and the situations in which it would be useful to put devices into multiple domains at the same time. There is a design limitation in the driver core by which a device can only be part of one power-management domain, which is related to the way device callbacks are invoked by the power-management core. The design of genpd assumes that the more complex cases should be covered by domain hierarchies. That does not seem to be sufficient to cover some use cases, however, and some ideas on addressing them are floating around.

Notifying devices of performance state changes of their master PM domain. Vincent Guittot led this session on behalf of Viresh Kumar who was not able to attend the summit. The goal was to discuss the prospect of adding a notification mechanism for when the performance state of genpd changes in order to optimize a device’s resource configuration — the frequency, for example. Vincent presented the use case that has raised the need for such a notification mechanism: a DynamiQ system with shared voltage between the DynamiQ shared unit and some cores. He also mentioned that DynamiQ is not the only configuration that can take advantage of this mechanism, because shared voltage is something common on embedded SoCs. He wanted to know where would be the best place to implement it, either directly in genpd or in a core framework to extend this mechanism to other features. The feature was well received by audience. The best place for such feature was discussed and some people mentioned other platforms where the GPU and core share their voltage domain. Nevertheless, it has been decided that starting with a implementation in genpd would be the best starting point. Extension to other framework can be decided later once a real usage arises.

CPU Idle Loop Ordering Problem. Rafael Wysocki and Thomas Ilsche discussed problems related to the ordering of the CPU idle loop in Linux and the solutions that were merged into Linux 4.17-rc1. Rafael started the session with a high-level introduction to CPU idle states and the CPU idle-time management code in the kernel. That code is executed by logical CPUs when there are no tasks to run; it is responsible for putting the CPUs into an idle state in which they draw much less power. If the CPUs support idle states, they will invoke the CPU idle governor to select an appropriate idle state and the CPU idle driver to make the CPU actually enter that state. The CPU will stay in the idle state until it is woken up by an interrupt. This sounds straightforward enough, but the idle-state selection in the cpuidle governor is based on predicting the time the CPU will be idle (idle duration) and that is not deterministic, which leads to problems.

Next, Thomas presented some results of measurements from his laboratory, clearly showing that CPUs might be put into insufficiently deep idle states by the CPU idle-time management code and might stay in those states for too long long, which was related to the non-deterministic nature of the idle duration prediction by the governor. It might be triggered by specific task activity patterns causing the idle governor to mispredict the idle duration, however, so it happened in practice and it could be reproduced on demand.

Rafael took over at this point and went into some details on how the CPU idle loop works. He said that, before the changes made in Linux 4.17-rc1, the idle loop would try to stop the scheduler tick before invoking the cpuidle governor. That made the governor’s job simpler, because the next timer event time was known after that and the governor needed to know that time to make the idle duration prediction, but it also was problematic if the predicted idle duration turned out to be shorter than the scheduler tick period. In that case either the tick did not need to be stopped (accurate idle duration prediction), so the overhead related to stopping it was unnecessary, or the idle state selected for the CPU would be too shallow (misprediction of idle duration) and the CPU would stay in it for too long.

The ordering of the idle loop was changed in 4.17-rc1 so that the decision on whether or not to stop the scheduler tick is made after invoking the idle governor to select the idle state for the CPU. This way, if the idle duration predicted by the governor is shorter than the scheduler tick period, the tick need not be stopped and the overhead related to stopping it is avoided. Moreover, if the predicted idle duration turns out to be too short, in which case the CPU might have gone into a deeper idle state, the CPU will be woken up by the scheduler tick timer, so the time it can spend in an insufficiently deep idle state is now limited by the length of the tick period.

Of course, there are a few complications with this approach. For example, the idle governor generally needs to know the next timer-event time to predict the idle duration, so it needs to be told when the next timer event will happen in two different cases, depending on whether or not the scheduler tick is stopped. It needs to take that information into account and, in addition to the idle-state selection, it has to give its caller (the idle loop) a hint on whether or not to stop the scheduler tick. Also, the governor needs to take the cases in which the CPU was woken up by the scheduler tick into account in a special way when collecting idle duration statistics (used for the idle duration prediction going forward), because in at least in some of these cases the CPU should have been sleeping much longer that it actually did (due to the tick wakeup). Fortunately, all of those complications could be taken care of.

The session concluded with a presentation of some test data to illustrate the impact of the idle loop changes. First, Rafael showed a number of graphs from the Intel Open Source Technology Center Server Power Lab demonstrating significant improvements in idle power consumption on several different server systems with Intel CPUs. Graphs produced by Doug Smythies during his testing of the kernel changes were also shown, illustrating significant reduction of power draw with no performance impact in some non-idle workloads. Next, Thomas showed his results demonstrating that he was not able to reproduce the original problem described previously with the modified kernel.

Finally, Rafael said that in his opinion the work on the CPU idle loop was a good example of effective collaboration between the code developer (himself), reviewers (Peter Zijlstra and Frederic Weisbecker) and testers (Thomas Ilsche and Doug Smythies) allowing a sensitive part of the kernel to be significantly improved in a relatively short time (several weeks).

Do more with less. Daniel Lezcano led a session on the increasingly fast evolution of the SoC market. With vendors developing increasingly powerful hardware, thermal constraints, with high temperature changes and fast transitions, become challenging. The presentation introduced a new passive cooling device in addition to the existing one by injecting idle cycles of a fixed duration but with a variable period. By synchronizing threads to force idle duration, based on the play_idle() API, the kernel is able to force the CPUs belonging to the same cluster to power down as well as the cluster component. Even if that adds latency to the system, it has the benefit of dropping the static leakage.

The presentation demonstrated the computation of the runtime duration relative to the idle duration. Then showed the experiment confirmed the theory. The third part of the presentation showed an improvement of the idle injection by combining the actual device which change the operating power points (OPPs) with idle injection cycle until it reaches the capacity equivalence of the lower OPP.

The feedback from the audience was:

The existing cpu cooling device was initially planned to be a combination cooling device, involving both the CPU frequency governor and cpuidle. There is some skepticism about the combination cooling device, which can be interesting if there is a small number of OPPs with a big gap, but nowadays, that is no longer the case.

Concerning idle injection, there is no reluctance to merge it. However, the code must be organized differently. Intel has a power clamping driver doing roughly the same thing, so Rafael Wysocki (maintainer of the PM of the linux kernel) would like to add an idle-injection framework in the drivers/powercap directory and propose an API to be used by the intel_powerclamp driver and the cpu idle cooling device.

directory and propose an API to be used by the driver and the cpu idle cooling device. Eduardo Valentin, (maintainer of the SoC thermal framework) and Javi Merino (co-maintainer of the cpu cooling device) think that, in any case, idle injection makes sense because, if the existing CPU-cooling device is unable to mitigate the temperature, there is no alternative to stop the temperature increase. The more the options we have, the better it is.

Next steps in CPU cluster idling. Ulf Hansson presented the latest updates around the CPU cluster idling series. The cpuidle subsystem handles each CPU individually when deciding which state to select for an idle CPU. It does not take into account the behavior of a cluster (a group of CPUs sharing resources such as caches) or consider power management of other shared resources (interrupt controllers, floating-point units, Coresight, etc). In principle, these are not problems on an SoC, where the policy for the cluster(s) is decided solely by firmware outside the knowledge of Linux. However, for many ARM systems (legacy and new), particularly those targeted for embedded, battery-driven devices, these problems do exist.

In the session, Ulf provided a brief overview of the significantly reworked version of the series, covering how the CPU topology is parsed from the device tree and modeled as power-management domains via the genpd infrastructure, as well as how the PSCI firmware driver comes into play during idle. Some concerns was raised around the new genpd CPU governor code that is being introduced as a part of the series, as currently it means that, in parallel, the cpuidle governor decides about the CPU and the genpd CPU governor decides about the cluster. More importantly, these decisions are made without sharing any information between them during the idle-state selection process, which could lead to problems in the long run. For example, in cases when a CPU supports more than one idle state besides WFI.

Discussions moved along to more generic thoughts about the future of cpuidle governors. In particular, interests were raised about continuing to explore the option of adding a new governor based on interrupt-prediction heuristics, thus providing something that people could play with and report experiences for, as a first step.

Advanced Architectures

Security and performance trade offs in power interfaces. Charles García-Tobin used the time assigned to his slot to talk about the latest design choices regarding power interfaces in the ARM world, with the intent of gathering feedback from kernel developers and possibly steering the course of future decisions. He introduced the topic by stating that ARM systems have an embedded legacy which has led to designs where several assumptions have been made: the kernel has full and exclusive control of power of the platform, interfaces can be as low-level as needed, and only the operating system is "running". However, it has been discovered with time that low-level interfaces can be dangerous (e.g., CLKscrew) and that it is very hard to create a kernel that controls every single kind of system and can solve every power problem (especially if considering that the kernel might be too slow to respond in cases where power delivery or thermal capping need quick modulation). Therefore, ARM has come to the realization that more abstract interfaces are needed (and ACPI is one method of abstraction, but it’s not suitable everywhere).

ACPI provides a feature called "collaborative processor performance control" (CPPC). This is an abstract interface where firmware and the kernel can collaborate on performance control; it is abstract enough that can be used on almost any system with a power controller. This makes it relatively straightforward to adopt in ARM-based enterprise systems. However, it needs feedback counters to give the kernel an idea about the per-CPU delivered performance versus the requested one. For the embedded world, ARM is also providing the System Control and Management Interface (SCMI), an extensible, standard platform controller interface to cover power and system management functions. SCMI will be driven directly from the kernel through SCMI drivers. ACPI-based kernels can also drive SCMI indirectly though ACPI-interpreted code. ACPI CPPC and SCMI implementations can coexist in the same power controller. As mentioned above, CPPC requires feedback signals to be used properly by the kernel; ARM is solving this problem by providing activity monitors (adopted from ARM v8.4) which are constantly running, read-only event counters that can be used, among other possible applications, to monitor delivered performance.

Charles concluded his session by asking the audience if ARM has come up with the right kind of feedback signals and what the kernel might do with this information. It was suggested that they might actually be still useful even if the kernel is not going to use them, for debugging, for example. They might also help to implement proactive policies when the CPU is in danger of hitting thermal capping situations (it was however noticed that this might not be so easy to achieve because the system might still be too slow to react). Charles also wanted to understand if people had experience with similar mechanisms. The audience mentioned RAPL as an always useful tool to have and that being able to request delivered performance for remote CPUs might be handy (when migrating tasks around for example).

Scaling Interconnect bus. Georgi Djakov and Vincent Guittot led a session on addressing use cases with quality-of-service (QoS) constraints on the performance of the interconnect between different components of the system (or SoC). In the presence of such constraints, the interconnect's clock rate has to be sufficiently high to allow those constraints to be met and there needs to be a way to make that happen when necessary. This will help the system to choose between high performance or low power consumption and keep the optimal power profile according to the workload. That is currently done in vendor kernels shipping with Android devices, for example, and basically every vendor uses a different custom approach. A generalization of it is needed in order to add support for that to the mainline kernel.

The idea is to add a new framework for representing interconnects along the lines of the clock and regulator frameworks, following the consumer-provider model. Developers would implement platform-specific provider drivers that understand the topology and teach device drivers to be their consumers. Various approaches have been discussed and existing kernel infrastructure was mentioned — Devfreq, power-management QoS, the common clock framework, genpd, etc. None of these seem suitable to configure complex, multi-tiered bus topologies and aggregate constraints provided by drivers. People agreed that having another framework in the kernel is fine if the use-case is distinct enough and encouraged other vendors to join.

An important point was raised about the cases when platform firmware takes care of tuning the interconnects. It was argued that the firmware has to be involved in the interconnect handling for the same reasons that were brought up during the "Security and performance trade offs in power interfaces" session. In such cases, in order to not clash with the firmware, the interconnect driver implementation could interact with the firmware and just provide hints about the current bandwidth needs, instead of controlling the hardware directly.

Another scenario that was discussed was about multiple CPUs or DSPs coexisting in an SoC that use shared interconnects. It is not clear how to exactly represent this topology in the kernel, as each CPU or DSP may have different constraints depending on whether it's active or sleeping. The current proposal was just to duplicate the topology for both states, collect constraints from drivers for both states and switch the configuration when a state is changed. This is still an open question, as more feedback was needed.

Tools

eBPF super powers on arm64. The BPF compiler collection (BCC) is a suite of kernel tracing tools that allow systems engineers to efficiently and safely get a deep understanding into the inner workings of a Linux system. Because they can't crash the kernel, they are safer than kernel modules and can be used in production.

In his talk, Joel Fernandes went through solutions, such as BPFd, for getting BCC working on embedded systems. He then went through several demonstrations, showing other tools for detecting performance issues. In the last part of the talk, a new tool being written to detect lock contention by monitoring the futex() system call was discussed. One of the issues is that it is difficult to distinguish between futexes used for locking versus other purposes. Because of this one may get false-positives. Solutions to this problem were discussed. Peter Zijlstra suggested using FUTEX_LOCK_PI from user space, which is a locking-specific futex() command. Other than this, there is no easy way to solve the problem, it seems.

The audience also discussed ideas for how eBPF can be useful for power testing. One of the ideas is to use the energy model available on some platforms with cpufreq residency information to calculate approximate power numbers. This can enable writing of tools like powertop . Another idea that was brought up pertained the use of eBPF to monitor scheduler behaviors and enforce that certain behaviors are working in the scheduler. Lastly, the crowd talked about using ePBF for workload characterization and discussed an existing tool that does this, called sched-time and developed by Josef Bacik. However, Joel mentioned that sched-time might need more work as it wasn’t working properly in his tests. It was agreed that it would be nice to fix it and use it to characterize workloads.

Scheduler unit tests. Dhaval Giani brought up the topic of using rtapp traces for regression testing, which was originally brought up at the Linux Plumbers Conference last year. Patrick Bellasi pointed out that he uses rtapp for functional testing as opposed to performance. Peter Zijlstra also mentioned that he was never able to get Facebook’s rtapp traces running on a system which was very similar to the Facebook setup. He remained sceptical of running on systems which were different from where the traces were generated. Many in the audience were also unsure if rtapp could be used to model memory, interrupts and other conditions affecting the workload. It was mentioned that rtapp could already do some of that and could be extended further.

The discussion moved toward trying to make rtapp test cases in use available more widely, and trying to write more functional test cases. Rafael suggested having test cases testing only one functionality, which is what Patrick confirmed their test cases did. Charles Garcia-Tobin mentioned that he would like to describe the scheduler as a set of invariants and use rtapp to test that those invariants are not broken. Most folks agreed on the idea of having a set of core invariants with each user describing their own requirements as test cases. The discussion ended with lunch approaching and Dhaval agreeing to look into this idea of invariants.

Comments (none posted)