Brief items

Kernel development news

Capsicum is a process sandboxing framework that was originally developed for FreeBSD; it was covered here in early 2012. The beginnings of Capsicum support for Linux have now been posted by David Drysdale for review; that provides a good opportunity to look at how this mechanism might fit into the Linux kernel. This work might be a hard sell in the kernel community, but Capsicum might just provide a sufficiently useful set of features to make the trouble worthwhile.

Capsicum is built around a concept called "capabilities," which, naturally enough, is entirely different from Linux capabilities (and from POSIX capabilities as well). In the Capsicum world, capabilities are attached to file descriptors to regulate which operations can be performed on those descriptors. So, for example, a file descriptor can only be read if the CAP_READ capability is present. Access to lseek() is controlled by CAP_SEEK , memory mapping has a set of capabilities ( CAP_MMAP_W to create a writable mapping, for example), truncation is controlled by CAP_FTRUNCATE , and so on. There are two special capabilities for ioctl() and fcntl() that restrict those calls to specific subcommands.

By default, open file descriptors are unrestricted. The normal mode of operation is that a process will apply restrictions to itself using the new cap_rights_limit() system call:

int cap_rights_limit(unsigned int fd, struct cap_rights *new_rights, unsigned int new_fcntls, int nioctls, unsigned int *new_ioctls);

After this call, operations on fd will be limited to those listed in new_rights . If those rights include CAP_FCNTL , then new_fcntls limits the set of fcntl() commands available. Similarly, if the capabilities on the file descriptor include CAP_IOCTL , the new_ioctls array (of length nioctls ) provides the set of allowed ioctl() commands. Multiple calls to cap_rights_limit() can be made for the same file descriptor, but those calls can only remove capabilities, never add them.

There is also a cap_rights_get() call to query the set of capabilities attached to a given file descriptor.

Needless to say, restrictions on file descriptors are of limited value if the constrained process can simply open new descriptors on the same objects. To prevent that from happening, Capsicum implements a "capability mode" entered via cap_enter() . Once that mode has been entered, access to most global namespaces is curtailed, preventing the opening of new files. A process can still open a file with openat() if it has a directory file descriptor (and, of course, the relevant capabilities are present). Such files are constrained to be underneath that directory, though — use of absolute pathnames or " ../ " is not allowed.

(As an aside, the "can only open files below this directory" functionality was deemed to be sufficiently useful that David pulled it out into a separate patch and made it available independently of Capsicum. This patch adds a new O_BENEATH_ONLY flag for calls like openat() . Once a directory has been opened with this option, the resulting file descriptor can only be used to open files that exist below that directory in the filesystem hierarchy.)

That said, the patch set as posted does not provide an implementation of cap_enter() . Also missing is the entire "process capabilities" mechanism, which represents specific processes as file descriptors so that the relevant system calls ( wait() , kill() , etc.) can be controlled. The patch set is described as being "part 1," so, one assumes, the remaining pieces will come later.

Within the kernel, system call implementations typically start by converting passed-in file descriptors to struct file with calls to fdget() . This is the point where David decided to apply the capability checks. When file descriptors are restricted with Capsicum, the normal file structure is replaced by a wrapper structure containing the rights information. Every fdget() call in the kernel (there are about 100 of them) must be replaced with a call to:

struct file *fdgetr(unsigned int fd, int caps ...);

Where caps is a variable-length list of capabilities that must be present for the operation to succeed. Callers must also be changed to deal with an "error pointer" return value; fdget() in current kernels can return NULL but not a specific error value. The result is that the patch set is somewhat invasive; that may be a cause of resistance should the patch set reach a point where it is being seriously proposed for inclusion.

The patch set currently works by creating a pair of new Linux security module (LSM) hooks to do the actual capability checks. David wonders, though, whether that is the right approach, since Capsicum is not a complete security module. If the kernel implemented stacked security modules, perhaps Capsicum could be run in this mode alongside another, more complete module. But stacking does not look like it will be supported in the kernel anytime soon. So Capsicum may well be better off implemented outside of the LSM framework.

There is another question that is worth considering here. The kernel's secure computing (seccomp) subsystem allows the loading of programs (written for the BPF virtual machine) that can, in theory, implement all of the restrictions found in Capsicum, especially if the recently proposed BPF changes are merged. It might not be easy, but it should be possible. Somebody is bound to ask whether the kernel needs another sandbox-creation mechanism with similar capabilities.

In general, the addition of new security-related subsystems can be a hard sell; many developers see little value for a lot of cost in these subsystems. But there is value in the ability to reduce the damage that can be done by a compromised process, and FreeBSD's use of Capsicum means that some applications have already had the necessary code added. Adding the same API to Linux would allow that work to be reused. So Capsicum seems worth considering, even if it will likely have some obstacles to overcome before merging is a possibility.

Comments (none posted)

The Berkeley Packet Filter, or BPF, is a special-purpose virtual machine that was originally developed to support applications that wanted to quickly filter packets out of a stream. Over the years, its use in Linux has grown; back in May, LWN characterized BPF as "the universal in-kernel virtual machine." Development on BPF continues; a new patch set adds some interesting capabilities and demonstrates some of what developer Alexei Starovoitov has in mind for this subsystem.

The first thing this patch series does is to move the BPF interpreter out of the networking subsystem. BPF can already be used with non-networking parts of the kernel, and the plans are for such uses to grow over time. So the BPF support code will move into a new subdirectory ( kernel/bpf ) and be maintained independently from the networking code.

Over the past few development cycles, Alexei has introduced a variant of BPF called "extended BPF" (eBPF) which adds a number of capabilities and performance improvements. Thus far, though, eBPF has only been used within the kernel itself; the existing BPF users load "classic" BPF programs into the kernel which are then translated to eBPF prior to execution. With this patch series, though, eBPF will be made available for direct use from user space. Among other things, that means that the eBPF instruction set will, once users pick it up, become difficult to change. There has been relatively little review of the instruction-set changes so far; anybody who has an interest in how this (significant) addition to the kernel's user-space ABI is defined might want to take a close look in the near future.

Loading programs

The patch series adds a new system call named, simply, bpf() ; it is a multiplexor for a range of different operations. Alexei also supplies a wrapper library to present those operations as a set of independent functions. Multiplexed system calls have not always been popular with reviewers in the past; if that pattern holds, we may see the multiplexed interface taken out and the various functions implemented directly as separate system calls.

So, for example, user space can load an eBPF program into the kernel with a call to:

int bpf(BPF_PROG_LOAD, int prog_id, enum bpf_prog_type type, struct nlattr *prog, int len);

Or, using the wrapper function:

int bpf_prog_load(int prog_id, enum bpf_prog_type prog_type, struct sock_filter_int *insns, int prog_len, const char *license);

In either case, prog_id is a number used to identify the program; these numbers exist in a single, global namespace. There is currently only one possible value ( BPF_PROT_TYPE_UNSPEC ) possible for type . In the actual system call, the BPF program is found in prog ; the networking roots of BPF show here, where a netlink attribute is used to hold the code. The length of the attribute array is passed in len . The wrapper, instead, hides the nlattr structure, but exposes a struct sock_filter_int structure (which will likely be renamed in the future) to hold the program. The license parameter will be discussed below.

Naturally, adding the ability to run programs within the kernel brings up a number of interesting security issues. So it not surprising that the biggest part of the patch set is a "verifier" that attempts to ensure that eBPF programs cannot harm the running system. The verifier simulates the execution of the program, looking for problematic behavior. Should something suspect turn up, the program will not be loaded.

The verifier looks for a number of things. It tracks the state of every eBPF register and will not allow their values to be read if they have not been set. To the extent possible, the type of the value stored in each register is also tracked. Load and store instructions can only operate with registers containing the right type of data (a "context" pointer, for example), and all operations are bounds-checked. The verifier also disallows any program containing loops, thus ensuring that all programs will terminate.

In this patch set, the CAP_SYS_ADMIN capability is required to use any of the bpf() system call functions. That restriction may limit interesting future uses of eBPF, but there are a number of potential issues (such as the single global ID namespace and resource use limits) that would have to be dealt with before that restriction could be lifted.

Licensing issues

The bpf_prog_load() wrapper also has a license parameter; the value passed there is stored in the nlattr array prior to the bpf() call. It is used to provide a string specifying the license that applies to the eBPF program to be loaded; if that license is not GPL-compatible, the kernel will refuse to load the program. This behavior already appears to be somewhat controversial; reviewers noted that full-blown kernel modules can be loaded (albeit with reduced access) without a GPL-compatible license declaration. It strikes some of them as strange to apply stricter rules to eBPF programs. In response, Alexei has said that future revisions might move to a module-like scheme where any program can be loaded but access to some functions might be restricted to GPL-compatible programs.

There could be some interesting implications from this type of restriction. BPF programs are often generated by other programs; the original BPF, after all, was meant to be emitted by the tcpdump tool. One might well wonder what the "source" of such a program actually is. If the Chromium browser generates an eBPF script to define a sandbox for a plugin module, which parts of Chromium, if any, are part of the source for that script? One can imagine that the discussion of this issue could go on for a long time indeed.

Maps

The other significant addition in this patch set is "maps." A map is a simple key/value data store that can be shared between user space and eBPF scripts and is persistent within the kernel. As an example of their use, consider this simple program included with the patch set. It creates a map with two entries, indexed by IP protocol type; an eBPF script then inspects passing packets and increments the appropriate entry for each. The program in user space can then query those entries to get a sense for what kind of traffic is passing through the system.

Maps can only be created or deleted from user space; eBPF programs do not have that capability. Maps are created and deleted with:

int bpf_create_map(int map_id, int key_size, int value_size, int max_entries); int bpf_delete_map(int map_id);

As with program IDs, the namespace for the map_id is shared across the entire system; there is no mechanism to specify which maps a given eBPF (or user-space) program may access. To store values into and retrieve values from maps, user space can call:

int bpf_update_elem(int map_id, void *key, void *value); int bpf_lookup_elem(int map_id, void *key, void *value); int bpf_delete_elem(int map_id, void *key); int bpf_get_next_key(int map_id, void *key, void *next_key);

Once again, these are the wrapper functions; the actual operations are done with the bpf() system call. On the eBPF side, access to maps is provided with a set of external functions. Interestingly, each place where use of eBPF programs is enabled (see below) must explicitly set up access to the map functions; this access is not provided to eBPF programs by default. Maps, in the end, function both as a persistent data store for eBPF programs and a means for communication with user space.

Running eBPF programs

There is one operation that is conspicuous by its absence in the discussion thus far: the ability to actually run an eBPF program. There is little point in running an eBPF program on demand from user space; there is not much that it could do that couldn't be more easily accomplished directly. Instead, eBPF programs are meant to respond to events within the kernel.

One common event, of course, is the receipt of a packet from the net. The patch set adds a new form of access to the socket filtering mechanism, allowing a program to directly attach an eBPF program to an open socket:

setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &prog_id, sizeof(prog_id));

Here, prog_id must be the ID number of a program previously loaded into the kernel with the bpf() system call.

eBPF programs can also be attached to tracepoints; such a program will be run every time that the tracepoint fires. This attachment is done by writing the string " bpf_ID " to the appropriate filter file in the tracing debugfs filesystem; once again, ID is the ID number of a loaded eBPF program. Missing, thus far, is a way to use eBPF directly with the secure computing (seccomp) mechanism; one assumes that will follow at some point.

All told, this patch set represents a significant addition to the BPF virtual machine. It is also a large addition to the kernel's user-space ABI; that suggests that it needs a rather higher level of review than it has seen so far. Once that review happens, the shape of the patch set may well change from what has been described here. But there seems to be little disagreement that the kernel can benefit from a more capable virtual machine that can be used in a number of contexts. So, sooner or later, some version of these patches will probably go in.

Comments (6 posted)

While it may not be the most controversial feature ever added to Linux, there is little difficulty in finding mailing lists or Internet forums containing heated arguments about the merits of control groups — or even downright denials that the feature has any merit at all. Being bereft of a personal agenda on the matter or any deep understanding of the issues, I find it very hard to choose a side in these debates, which seriously lessens the enjoyment I can receive from them. As synthesizing a deep understanding is, I find, much more noble than synthesizing a personal agenda, and as having a discerning audience is an excellent motivation for thorough research, these articles are intended to help me and, hopefully, other readers to develop the deep understanding necessary to truly enjoy an informed debate on Linux control groups, which are also known as "cgroups".

To gain this understanding we will need both a broad perspective and some detailed analysis. The first two articles in this series will try to provide some perspective by first exploring the history of Unix to see what questions it raises about process groups, and then looking at hierarchies, both within and without the Unix family, to give us some yardsticks to measure the hierarchical aspects of cgroups.

Subsequent articles will then delve into the nitty gritty details of cgroups and its various control subsystems and attempt to relate what we find to the questions and metrics the broader perspective gave us.

Sixth Edition Unix

Unix has some history with process groupings and, more significantly, some evolution. Observing this change can help us to see important details. While it would be nice to start at the very beginning, a more practical starting point is the Sixth Edition of Unix, known hereafter as "V6 Unix".

V6 Unix dates from the mid-1970s and was the first edition to get much exposure outside of Bell Labs. It supports two different groupings of processes, though to justify that we should first clarify what we mean by "a grouping of processes".

As in number theory, not every set is a group. The set of processes with a prime identifying number, for example, is certainly a set. However there is no mechanism in Unix (then or now) to distinguish these processes in any way from those with composite ID numbers. The remaining set of processes, with neither a prime nor a composite ID number, does have a distinctive behavior. As it contains only PID 1, though, it is hardly worth considering as a group.

A number-theoretic group includes an operation that operates on members of the group with particular rules for what an "operation" is. For process groups, we will accept a much more vague concept and a different role for an "operation", but still there must be some operation within Unix which can affect, or be affected by, a particular process group.

A less facetious set than the "prime PID" set would be the set of processes owned by a given user ID (or "UID"). We won't consider this to be a group in V6 Unix because while there are operations (e.g. kill ) that will affect processes in one group differently from processes in another group, there is no way to interact with the group as a whole.

The first set that really forms a meaningful group is the set of children of some given process. The only operation in V6 Unix which recognizes this group is the wait() system call and it can only detect if the group is empty or not empty. If wait() returns with error ECHILD , then the group is empty. If it returns without an error, or doesn't return, then the set wasn't empty when the call was made (though it might be empty when the call completes).

The same operation could be interpreted as applying to the set of descendants of the given process — that is, the children and any children of those children, etc. ECHILD is returned if and only if this set is empty too. This group has a significantly different behavior, though. In the group of children, a process cannot escape the group except by exiting. In the group of descendants, a process can escape, if it is not an immediate child, when any ancestor of it exits.

Whether the ability to escape is a valuable property of groups or not depends, somewhat, on use-cases and expectations. In V6 Unix, the descendants of PID 1 (that set with a unity ID number) cannot escape but descendants of any other process can. This remained the case for variants of Unix and into Linux until Linux 3.4, when the PR_SET_CHILD_SUBREAPER option for prctl() was added. This allows a process to declare its group of descendant processes as closed so processes cannot escape. If any descendant dies, then all its children are inherited by the process which set this option.

The other, possibly more interesting, process grouping present in V6 Unix is determined by the p_ttyp field in the process structure (defined in proc.h ), which is described as the "controlling tty". Whenever a process opens a "tty" device (see dhopen() in dh.c ), which would be a serial data connection to a teletype or similar terminal, then if this field is not already set, it will be set to point to the newly opened device. The field is also inherited over a fork() or exec() , so once a process gained a controlling tty, that would continue to apply to the process and all of its future descendants.

One effect of p_ttyp is that any I/O to /dev/tty will go to the controlling tty, but this doesn't really qualify as a "group" operation, as it affects individual processes separately. The "group" operations for controlling ttys involve the delivery of signals (see signal() in sig.c ). If a DEL or FS ( control-\ ) character is typed on a tty, then the signal SIGINT or SIGQIT is sent to all processes in the group that have that tty as their controlling tty. Similarly if a disconnect event is detected (like a modem hanging up), a SIGHUP is sent to the same group of processes. Signals can also be sent with the kill() system call. An attempt to send a signal to PID 0 will send it to every process with the same controlling tty as the sending process.

It is quite reasonable to think of this grouping as a prototype of cgroups. It is clearly about the grouping of processes and clearly about controlling those processes — though only through sending a signal. These groups are created automatically, based on behavior, and are permanent — once in a group, the process cannot escape. It appears that they were not perfect, though. The next edition brought changes.

Seventh Edition Unix

While V6 Unix supported process groups, it did not use that terminology. V7 Unix did, and had a richer concept of group. The p_ttyp field still exists, though its role was restricted to managing /dev/tty access. It was renamed to u_ttyp and moved to struct user ( user.h ) — a structure that could be swapped out to disk with the rest of the process. struct proc ( proc.h ) instead had a new p_pgrp field to manage process groups. It was set on the first open() of a tty and used for delivering SIGINT , SIGQUIT (which has now gained a 'U'), and SIGHUP , and for delivering signals sent to PID 0. But V7 also brought more flexibility.

The key change was that process groups now had an independent identity and an independent name — independent of the tty, at least. When a process without a controlling tty first opened a tty, a new process group would be created with an ID number matching the process ID number of that process. Though the ID was copied, it really was a new ID for a new object. The group can continue to exist even if the original process exits. Any remaining children will keep the group active and prevent the ID number from being reused, either as a process-group ID or as a process ID.

One consequence of this is that if you log off a tty and log back on again, you get a new process group, and the t_pgrp field in the struct tty structure will be changed. Unlike the situation in V6 Unix, a signal sent to a process group will never go to a process from a previous login on the same tty.

Another consequence is that process groups could be used for more than just ttys. Seventh Edition Unix had a "multiplexor driver" ( mpxchan in mx1.c and mx2.c ) which, though short-lived, still leaves a legacy in the current stat() manual page:

3000 S_IFMPC 030000 multiplexed character special (V7) [...] 7000 S_IFMPB 070000 multiplexed block special (V7)

The multiplexor worked a little bit like a socket interface and allowed different processes to connect to each other. An interface was available to form a separate process group for several interconnecting processes, so the master could send a signal to all other members of the group.

V7 Unix process groups were still closed, with processes generally unable to leave them. mpxchan does appear to allow a process to leave its original process group to join a group for a multiplexed channel, but it isn't clear that this was an intended consequence.

Fourth Berkeley Software Distribution

It is a bit of a large jump from V7 to 4BSD, having at least Unix 32v and 3BSD in the meantime. But this is, to some extent, a personal journey and 4.3BSD was the next release that I used.

In 4BSD, we find that a lot has happened with process groups. In 4.3BSD, the set of processes with the same UID has become a group, in that a signal can be sent to all processes in that set (see kill() in kern_sig.c ). Sending a signal to a PID of -1 will deliver it to all processes with the same UID as the sending process (though, if sent from a privileged process, the signal will be sent to every process regardless of UID). More significant is that by 4.4BSD there was now a limited hierarchical structure to process groups.

One of the many innovations in the Berkeley versions of Unix was "job control". A "job" here refers to one or more processes working together on a particular task. Unix already had the ability to put some jobs in the "background", but it was implemented in a fairly ad hoc manner. Such processes would be told to ignore any signals from the user ( SIGINT and SIGQUIT would both be set to SIG_IGN before starting the process) and the shell would simply not wait for those processes to finish. This mostly worked well, but once a job was in the background, it had to stay there. Also, if such a process wrote to the terminal, its output could get mingled with output from foreground processes, resulting in a mess.

With BSD "job control", each job is placed into its own process group and the shell can tell the terminal to change its idea of which is the current foreground job (and so would receive signals and input and could generate output), and which jobs are in the background so they should be isolated.

The pre-existing concept that process groups were essentially per-login was still important, if only to provide a degree of compatibility with "System V" Unix, a separate path of development from AT&T. In 4.4BSD, these per-login process groups were re-introduced as "sessions". Each process ( proc.h ) was (potentially) a member of a process group. Each process group was a member of a session. Each terminal ( tty.h ) had a foreground process group, t_pgrp , and a controlling session, t_session .

Sessions were, and are, much like the V7 Unix process groups, though there are differences. One is that it is not possible to send a signal to all processes in a given session: that functionality only works for process groups, which are now per-job. Another is that a process can leave its session and create a new one by simply calling the setsid() system call.

Either of these are sufficient to frustrate the task of killing all processes at logout — as local policy required in student labs a long time ago in a career far away. A frustration which was, at the time, unfixable due to a dependence on closed-source kernels.

On a modern, windowed desktop, these sessions and process groups are still present, but don't mean quite what they once meant. It is fairly easy to see how session IDs and process-group IDs are assigned by displaying the sess and pgrp fields with ps , as follows:

ps -axgo comm,sess,pgrp

There is no longer a well defined process grouping for a login session. Instead, each terminal window gets its own session, as do various other applications if they were written to request one. Each job started from the shell prompt still gets its own process group, but there is much less need to start and stop these jobs — rather than suspending the currently running job in one terminal window, it is just as easy to pop up another window and run some new command there.

To properly represent the groupings of processes relevant for a modern desktop, we really need a deeper hierarchy. One level would represent the login sessions, one would represent the applications running in those sessions, and one could be used for jobs within an application. The sessions and process groups that Linux inherits from 4.4BSD can give us only two of those levels. Maybe we can look to cgroups for the third.

Issues

Reflecting on these changes and experiences with process groups, there are a number of issues that may be worth considering when trying to form an opinion on the more modern form of cgroups:

Names for groups: In V6 Unix the only name was that of the associated resource: a tty. This changed to be an ID number in the same namespace as process ID numbers. In retrospect this sharing of namespaces might seem a little clumsy, though it was clearly convenient. As the kernel was solely responsible for allocating names (another noteworthy feature), any clumsiness remained safely inside that kernel.

Overlapping uses. The same mechanism was originally used to guide both the delivery of signals and the processing of I/O to /dev/tty . These were quickly separated since they are clearly related, but are not identical.

. These were quickly separated since they are clearly related, but are not identical. Should a process be able to escape its containing group? We have seen a progression in the answer to this, from "no" to "yes". Having the flexibility can be useful in some cases, but having control can be useful in others. Being able to enter a different job under the same session is easy to defend. Being able to create a new session is not so obviously useful for an unprivileged process.

What role does a hierarchy play? Process groups have only gained even a limited hierarchy toward the end of their development. Is this important? How can it be used?

That last point, hierarchy, certainly is important. A lot of the recent changes in cgroups, and a significant part of the disagreements, relate to hierarchy. While the history of process groups has given us a glimpse of hierarchy it is not enough to develop any real understanding. For that we will need to look elsewhere. In the next installment we will examine a few different "elsewheres" to develop a perspective on hierarchy that we will then take to the inner details of cgroups to see if the former can help us to better understand the latter.

Comments (13 posted)