Making containers safer

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

On day one of the Linux Security Summit North America (LSS-NA), Stéphane Graber and Christian Brauner gave a presentation on the current state and the future of container security. They both work for Canonical on the LXD project; Graber is the project lead and Brauner is the maintainer. They looked at the different kernel mechanisms that can be used to make containers more secure and provided some recommendations based on what they have learned along the way.

Caring about container safety

Graber began by asking "why do we care about safe containers?" Not everyone does, he said, but the Linux containers project, which LXD and LXC are part of, has been working on containers for over ten years. LXC and LXD are used to create "system containers", which run unmodified Linux distributions, not "application containers" like those created using Docker. The idea is that LXD users will use the same primitives as they would if they were running the distribution in a virtual machine (VM); that they are actually running them on a container is not meant to be visible to them.

Administrators of these system containers will often give SSH access to the "host" to their users, who will run whatever they want on them. That is one of the reasons the project cares a lot about security. It uses every trick available, he said, to secure those containers: namespaces, control groups, seccomp filters, Linux security modules (LSMs), and more. The goal is to use these containers just like VMs.

Since the project targets system containers, it builds images for 18 distributions with 77 different versions every day, Graber said. That includes some less-popular distributions in addition to the bigger names; it also builds Android images. Beyond that, LXD is being used as part of the recent Linux desktop on Chromebooks feature of Chrome OS. There are per-user VMs in Chrome OS, but the Linux desktop distribution runs in a container with some persistent storage, he said. It has GPU passthrough and other features to make the desktop seamlessly integrate with Chrome OS.

All of the users of those distribution images built by the project can run any code they want inside those containers, which means that the Linux containers project needs to care a lot about security, Graber said.

Privileged versus unprivileged

There are two main types of containers, he said: privileged and unprivileged. But the Linux kernel has no notion of containers, they are purely a user-space construct built from the tools provided by the kernel. Privileged containers are those where root inside the container is the same user as root outside the container (i.e. UID 0). That is not true for unprivileged containers because UID 0 inside the container is mapped to some other, unprivileged user outside of the container via user namespaces.

Sadly, he said, the vast majority of containers that are run today are privileged containers. That includes most Docker containers and most of the containers that are run with Kubernetes. The main problem is that an attacker who can break out of the container now has root privileges on the host; the whole system is compromised. The security of those containers depends on LSMs, Linux capabilities, and seccomp filters; the container's privileges are not isolated enough and the policies for the various security mechanisms tend to "fail open".

The LXD project does not consider privileged containers to be safe to run; it is not a configuration that is supported. The project does what it can to close any of the holes it knows about, but strongly recommends against using privileged containers.

For unprivileged containers, since root in the container does not map to UID 0 in the host system, a container breakout is still serious, but not as damaging as it is for a privileged container. There is also a mode where each LXD container in a system will have its own non-overlapping UID and GID ranges in the host, which limits the damage even further. Any breakout will result in a process with a UID and GID that is not shared with any other process in any other container (or the host system itself).

User namespaces have been around since the 3.12 kernel, but few other container management systems use the feature to isolate their containers. Part of the reason for that is the difficulty in sharing files between containers because of the UID mapping. LXD is currently using shiftfs on Ubuntu systems to translate UIDs between containers and the parts of the host filesystem that are shared with it. Shiftfs is not upstream, however; there are hopes to add similar functionality to the new mount API before long.

The perils of privileges

After that, Graber turned the floor over to Brauner, who started by rhetorically asking "are privileged containers really that unsafe?" His answer was an unequivocal "yes"; he listed a half-dozen or so "pretty bad" CVEs that have affected privileged containers over the last few years. That list included CVE-2019-5736, which was the runc container-confinement breakout that was disclosed in February; it was a bad way to start the year in terms of container security. As far as he can tell, all of those CVEs would not affect unprivileged containers like those created by LXD.

It should be fairly trivial to use all of the available security mechanisms, but it turns out not to be. It is often the case that there is some way to block the problem behavior, but it is not used by the container managers for a variety of reasons. Some of those technologies may not be well documented, he said, which is a problem that the kernel developers should fix.

He began with namespaces, which are not used enough in his view. In the application container world, too few of the namespaces are used, typically just the mount namespace. All of them have some security benefit by isolating some resource from the rest of the system. The most obviously useful namespace is the user namespace, which isolates privileges between containers.

Namespaces have a "clunky API for sure", he said. Kernel developers should find a way to make it "nicer in some way". Properly ordering the creation of the namespaces at container startup time is important. In addition, there is no way to atomically setns() into all of the namespaces for a process. Brauner said he has some ideas on how to make that work better.

Next up was seccomp filters, which are "essential for privileged containers", Brauner said. Allowing privileged containers to call open_by_handle_at() , for example, will lead directly to a compromise. Seccomp filters provide a "useful safety net" for unprivileged containers, but are not truly required. Typically, unprivileged containers can maintain a blacklist of system calls that cannot be called, while privileged containers will need to create a whitelist of safe system calls.

LSM support is also essential for privileged containers, he said. Access to various files in procfs and sysfs must be blocked or the container can be compromised. The LSMs most frequently used by container managers are SELinux and AppArmor, but other "minor" LSMs (which can stack) are also added into the mix sometimes.

Recent and future features

Brauner then described some security features that had landed in the kernel recently as well as some upcoming features that may be coming or are wished for. The ability to defer seccomp filter decisions to user space was added for the 5.0 kernel. It allows user space to inspect the arguments to the system call in a race-free way, so things like path names can be inspected. LXD uses that new feature to allow the distributions in its containers to successfully call mknod() for certain devices (e.g. /dev/null ) but not others that are dangerous to have in the container. The old way of handling that was to bind mount the safe devices from the host filesystem.

Deferring to user space is a "nifty feature", he said, but there are some problems with it. For example, it requires that user space handle the system call itself, which means there are some tricky privilege issues that need to be carefully considered. If the system call should be made, it needs to be done in the context of the container user, with its privileges, not those of the container manager.

All of that also makes the feature a bit annoying to use, he said. It would be better if there were a way to tell the kernel to simply resume the system call. There is also a problem with flags passed to some new system calls, such as clone3() , because they are not passed directly as a parameter but are instead inside a structure whose address is passed. But that means the in-kernel seccomp filtering cannot use the flag values as it is restricted to the parameters passed in registers and cannot chase pointers. He sent an email to the ksummit-discuss mailing list about seccomp and hopes to discuss some of those annoyances and possible solutions to them at the Kernel Summit in September.

Stacking major LSMs (SELinux, AppArmor, and Smack) is something the LXD project would like to see as well. Being able to run containers with their own LSM on a host with a different major LSM, such as an Android container that uses SELinux on a Ubuntu system (which uses AppArmor) or an Ubuntu container on Fedora (which also uses SELinux), would be useful.

The SafeSetID LSM has been merged for Linux 5.3. It restricts UID/GID transitions to only those allowed by a whitelist. It came from Chrome OS and will be quite useful for privileged containers.

The new mount API split the functionality of the mount() system call into a bunch of separate calls that will allow some nice features for container managers. For example, it will allow anonymous mounts, which are mounts that are not attached to any path in the filesystem but will still allow access to the files for the process holding the file descriptor for the mount. There may be a way to add the UID/GID shifting feature to the new API to eliminate the need for shiftfs.

Brauner also mentioned the new process ID (PID) file descriptor (pidfd) feature. Pidfds are file descriptors that refer to a process, so that signals can be sent to the right process without fear of hitting the wrong target if the PID gets reused. It also allows processes to get exit notifications for non-child processes. Pidfds are used by LXD; there may be more features coming for pidfds as well, he said.

In wrapping up, Graber said that other container managers can learn from what the LXD project has done. He thinks it is imperative that they stop using privileged containers and start using user namespaces, but they do not have to figure everything out on their own. He does not believe that containers can ever really contain unless they separate the privileges inside the container from those outside of it.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for funding to travel to San Diego for LSS-NA.]

