Filesystem mounts in user namespaces

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

User namespaces are a sandbox within which an otherwise unprivileged user can appear to be root. Given a kernel with user namespaces enabled, any user can create such a namespace and, while they are running within it, do anything that root is allowed to do. The intent is that system administrators can allow the creation of these namespaces by unprivileged users, secure in the knowledge that any actions carried out within them will, in the context of the wider system, be constrained by the user's global credentials. There are some limits to what can be done within user namespaces though, with the mounting of filesystems being near the top of the list. A recent attempt to ease that limitation shows just how hard the problem is to solve.

Linux systems place a lot of trust in their underlying filesystems; the data stored there can, in the form of setuid files, file capabilities, and security labels, be used to empower processes with elevated privileges. When user namespaces were first added, finding a way to safely restrict that trust within a namespace was deemed to be an overly hard problem, so, for the most part, the mount() system call is denied to processes running within user namespaces, even if they are privileged in their namespaces.

There are, of course, exceptions. The restriction applies to new filesystem mounts. Operations that apply to already-mounted filesystems (bind mounts, in particular) are allowed. Even with new mounts, there is an exception for filesystems that, via the FS_USERNS_MOUNT flag, identify themselves as being safe for use within user namespaces. The list of such filesystems is short; it includes /proc , sysfs, ramfs, tmpfs, and not much more. Restricting mounts to these filesystems makes user namespaces safer to deploy, but it also restricts their functionality. In particular, containers running within their own user namespace are limited in how they can change their own filesystem layout.

It would be nice to offer more flexibility with regard to the filesystem layout within user namespaces if it were possible. In an attempt to make that happen, Seth Forshee recently posted a patch set laying the groundwork for a relaxation of the restrictions on filesystem mounts within user namespaces. It is only the first of a series of steps in that direction, but even this step raised a number of difficult questions.

The patch set starts by adding a new field ( s_user_ns ) to the super_block structure for each mounted filesystem to track the namespace from which it was mounted. It also adds a check to ensure that the mounting process has CAP_SYS_ADMIN inside its current namespace — but does not remove the check on whether the filesystem is marked for mounting within namespaces. An early patch also changes the way the "no devices" flag (set, among other ways, with the nodev option to mount ) is handled, making it impossible (most of the time) to open device files on filesystems mounted outside of the root user namespace.

The next step is to deal with the handling of setuid and file capabilities. A user could create an arbitrary filesystem and add a setuid-root executable file to it. That filesystem, when mounted within a user namespace, would enable running as root within that namespace. Thus far, that is not a problem; a user capable of mounting the filesystem within that namespace is already privileged, so respecting setuid and file capabilities adds no privilege-escalation risk. If such a file leaks out of the namespace, though, it could be used to escalate privileges in a namespace where the user is not otherwise privileged. To avoid that possibility, the setuid and file-capabilities code is made to check for a filesystem mounted in a foreign namespace and to ignore setuid and file capabilities in that case.

There was little disagreement over the changes as described thus far. That agreement did not hold into the next part of the patch set, though, which caused the SELinux and Smack security modules to ignore labels attached to files in user-namespace-mounted filesystems. Again, the concern that drove this change was the possibility that a maliciously labeled object could be passed out of the namespace, most likely by passing an open file descriptor to a cooperating process. Simply ignoring security labels makes that particular problem go away.

But, as Casey Schaufler pointed out, that fix comes at the cost of adding other problems. It essentially makes user namespaces incompatible with security modules which, given the increasing use of security modules, is probably not a desirable outcome. A Smack-based system, in the absence of security labels, will probably not respond in a useful way; the most likely outcome is to make all files readable and none writable. With SELinux, instead, as noted by Stephen Smalley, the result would be to make all files inaccessible.

There was a fair amount of discussion about how severe the problem would be and whether security-module support needs to be present in this feature from the beginning. A solution that seemed to gather some support was to look at the security labels for the backing store of any given filesystem. Most filesystems mounted within a user namespace will be loopback mounts to a local file; that file will have its own security labels. Those labels can be propagated into the mounted filesystem and applied to every file found therein. That makes the access policy for files within the filesystem the same as the policy for the file containing the filesystem itself.

There was one other problem which, despite the fact that it is likely to be harder to solve, saw less discussion. Dave Chinner initially raised the issue of using a hostile filesystem to attack the kernel directly. Filesystems have to be able to trust the underlying data they are working with; if that data can be manipulated, there is no end of problems that can result. One possibility is denial-of-service attacks through the creation of hard-link loops in the filesystem metadata, but there are almost certainly more sinister opportunities available as well.

Fixing this problem, Dave said, is not a simple task:

The only way a filesystem would be able to trust what it reads from disk has not been tampered with in a system with untrusted mounts is if it has some kind of cryptographically secure signature in the metadata and the attacker is unable to access the key for that signature. No filesystem we have has that capability and AFAIA there are no plans for any filesystem to implement such tamper detection.

One might be tempted to point out that this problem already exists: in some system configurations, unprivileged users can plug in a USB drive containing a malicious filesystem and mount it now. The difference is that such attacks require physical presence, while an attack via a filesystem image stored in a file can be made from anywhere. That significantly increases the attack surface exposed by the kernel and, according to Dave, is asking for a great deal of trouble.

It is also tempting to think that this problem could be handled by extensive checking of the filesystem at mount time. Even without the delays that would be imposed by, essentially, running fsck every time the filesystem is mounted, though, this idea has problems. Finding every potential attack in a filesystem image will always be a challenging task, even if the filesystem itself is stable. But, in this case, an attacker will have control over the underlying backing store and, thus, will be able to change things at any time. A filesystem that is correct and safe at mount time may no longer be shortly thereafter.

There were no proposals for solutions to the hostile-filesystem problem. In truth, it is not at all clear that the kernel can be made safe against attacks via user-manipulated filesystem images or that such a fix, if it were possible, would not bring an unacceptable performance penalty with it. But, in the absence of some sort of assurance that they can be made safe, unprivileged filesystem mounts are unlikely to gain acceptance; even if the feature gets into the kernel, distributions would be likely to disable it.

In the end, the only safe way to allow unprivileged mounts may be the filesystem in user space (FUSE) mechanism, which isolates the filesystem code into a separate user-space process. FUSE is not the ideal solution for those who would like to add native unprivileged filesystem mount support to user namespaces, but, unless the security concerns can be addressed, it may, in the end, be the only viable solution.

