Anatomy of a user namespaces vulnerability

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

An exploit posted on March 13 revealed a rather easily exploitable security vulnerability (CVE 2013-1858) in the implementation of user namespaces. That exploit enables an unprivileged user to escalate to full root privileges. Although a fix was quickly provided, it is nevertheless instructive to look in some detail at the vulnerability, both to better understand the nature of this kind of exploit and also to briefly consider how this vulnerability came to appear inside the user namespaces implementation. General background on user namespaces can be found in parts 5 and part 6 of our recent series of articles on namespaces.

Overview

The vulnerability was discovered by Sebastian Krahmer, who posted proof-of-concept code demonstrating the exploit on the oss-security mailing list. The exploit is based on the fact that Linux 3.8 allows the following combination of flags when calling clone() (and also unshare() and setns() ):

clone(... CLONE_NEWUSER | CLONE_FS, ...);

CLONE_NEWUSER says that the new child should be in a new user namespace, and with the completion of the user namespaces implementation in Linux 3.8, that flag can now be employed by unprivileged processes. Within the new namespace, the child has a full set of capabilities, although it has no capabilities in the parent namespace.

The CLONE_FS flag says that the caller of clone() and the resulting child should share certain filesystem-related attributes—root directory, current working directory, and file mode creation mask (umask). The attribute of particular interest here is the root directory, which a privileged process can change using the chroot() system call.

It is the mismatch between the scope of these two flags that creates the window for the exploit. On the one hand, CLONE_FS causes the parent and child process to share the root directory attribute. On the other hand, CLONE_NEWUSER puts the two processes into separate user namespaces, and gives the child full capabilities in the new user namespace. Those capabilities include CAP_SYSCHROOT , which gives a process the ability to call chroot() ; the sharing provided by CLONE_FS means that the child can change the root directory of a process in another user namespace.

In broad strokes, the exploit achieves escalation to root privileges by executing any set-user-ID-root program that is present on the system in a chroot environment which is engineered to execute attacker-controlled code. That code runs with user ID 0 and allows the exploit to fire up a shell with root privileges. The exploit as demonstrated is accomplished by subverting the dynamic linking mechanism, although other lines of attack based on the same foundation are also possible.

The vulnerability scenario

The first part of understanding the exploit requires some understanding of the operation of the dynamic linker. Most executables (including most set-user-ID root programs) on a Linux system employ shared libraries and dynamic linking. At run time, the dynamic linker loads the required shared libraries in preparation for running the program. The pathname of the dynamic linker is embedded in the executable file's ELF headers, and is listed among the other dependencies of a dynamically linked executable when we use the ldd command (here executed on an x86-64 system):

$ ldd /bin/ls | grep ld-linux /lib64/ld-linux-x86-64.so.2 (0x00000035b1800000)

There are a few important points to note about the dynamic linker. First, it is run before the application program. Second, it is run under whatever credentials would be accorded to the application program; thus, for example, if a set-user-ID-root program is being executed, the dynamic linker will run with an effective user ID of root.

Executable files are normally protected so that they can't be modified by users other than the file owner; this prevents, for example, unprivileged users from modifying the dynamic linker path embedded inside a set-user-ID-root binary. For similar reasons, an unprivileged user can't change the contents of the dynamic linker binary.

However, suppose for a moment that an unprivileged user could construct a chroot tree containing (via a hard link) the set-user-ID-root binary and an executable of the user's own choosing at /lib64/ld-linux-x86-64.so.2 . Running the set-user-ID-root binary would then cause control first to be passed to the user's own code, which would be running as root. The aim of the exploit is to bring about the situation shown in the following diagram, where pathnames are shown linked to various binary files:

The key point in the above diagram is that two pathnames link to the fusermount binary (a set-user-ID-root program used for mounting and unmounting FUSE filesystems). If a process outside the chroot environment executes the /bin/fusermount binary, then the real dynamic linker will be invoked to load the binary's shared libraries. On the other hand, if a process inside the chroot environment executes the other link to the binary ( /suid-root ), then the kernel will load the ELF interpreter pointed to by the link /lib64/ld-linux-x86-64.so.2 inside the chroot environment. That link points to code supplied by an attacker, and will be run with root privileges.

How does the Linux 3.8 user namespaces implementation help with this attack? First, an unprivileged user can create a new user namespace in which they gain full privileges, including the ability to create a chroot environment using chroot() . Second, the differing scope of CLONE_NEWUSER and CLONE_FS described above means that the privileged process inside a new user namespace can construct a chroot environment that applies to a process outside the user namespace. If that process can in turn then be made to execute a set-user-ID binary inside the chroot environment, then the attacker code will be run as root.

A three-phase attack

Although Sebastian's program is quite short, there are many details involved that make the exploit somewhat challenging to understand; furthermore, the program is written with the goal of accomplishing the exploit, rather than educating the reader on how the exploit is carried out. Therefore, we'll provide an equivalent program, userns_exploit.c , that performs the same attack—this program is structured in a more understandable way and is instrumented with output statements that enable the user to see what is going on. We won't walk though the code of the program, but it is well commented and should be easy to follow using the explanations in this article.

The attack code involves the creation of three processes, which we'll label "parent", "child", and "grandchild". The attack is conducted in three phases; in each phase, a separate instance of the attacker code is executed. This concept can at first be difficult to grasp when reading the code. It's easiest to think of the userns_exploit program as simply offering itself in three flavors, with the choice being determined by command-line arguments and the effective user ID of the process.

The following diagram shows the exploit in overview:

In the above diagram, the vertical dashed lines indicate points where a process is blocked waiting for another process to complete some action.

In the first phase of the exploit, the program starts by discovering its own pathname. This is done by reading the contents of the /proc/self/exe symbolic link. The program needs to know its own pathname for two reasons: so it can create a link to itself inside the chroot tree and so it can re-execute itself later.

The program then creates two processes, labeled "parent" and "child" in the above diagram. The parent's task is simple. It will loop, using the stat() system call to check whether the program pathname discovered in the previous step is owned by root and has the set-user-ID permission bit enabled. This causes the parent to wait until the other processes have finished their tasks.

In the meantime, the "child" populates the directory tree that will be used as the chroot environment. The goal is to create the set-up shown in the following diagram:

The difference from the first diagram is that we now see that it is the userns_exploit program that will be used as the fake dynamic loader inside the chroot environment. Furthermore, that binary is also linked outside the chroot environment, and the exploit design takes advantage of that fact.

Having created the chroot tree shown above, the child then employs clone(CLONE_NEWUSER|CLONE_FS) to create a new process—the grandchild. The grandchild has a full set of capabilities, which allows it to call chroot() to place itself into the chroot tree. Because the grandchild and the child share the root directory attribute, the child is now also placed in the chroot environment.

Its small task complete, the grandchild now terminates. At that point, the child, which has been waiting on the grandchild, now resumes. As its next step, the child executes the program at the path /suid-root . This is in fact a link to the fusermount binary. Because the child is in the initial user namespace and the fusermount binary is set-user-ID-root, the child gains root privileges.

However, before the fusermount binary is loaded, the kernel first loads its ELF interpreter, the file at the path /lib64/ld-linux-x86-64.so.2 . That, as it happens, is actually the userns_exploit program. Thus, the userns_exploit program is now executed for a second time (and the fusermount program is never executed).

The second phase of the exploit has now begun. This instance of the userns_exploit program recognizes that it has an effective user ID of 0. However, the only files it can access are those inside the chroot environment. But that is sufficient. The child can now change the ownership of the file /lib64/ld-linux-x86-64.so.2 and turn on the file's set-user-ID permission bit. That pathname is, of course, a link to the userns_exploit binary. At this point, the child's work is now complete, and it terminates.

All of this time, the parent process has been sitting in the background waiting for the userns_exploit binary to become a set-user-ID-root program. That, of course, is what the child has just accomplished. So, at this point, the parent now executes the userns_exploit program outside the chroot environment. On this execution, the program is supplied with a command-line argument.

The third and final phase of the exploit has now started. The userns_exploit program determines that it has an effective user ID of 0 and notes that it has a command-line argument. That latter fact distinguishes this case from the second execution of the userns_exploit and is a signal that this time the program is being executed outside the chroot environment. All that the program now needs to do is execute a shell; that shell will provide the user with full root privileges on the system.

Further requirements for a successful exploit

There are a few other steps that are necessary to successfully accomplish the exploit. The userns_exploit program must be statically linked. This is necessary so that, when executed as the dynamic linker inside the chroot environment, the userns_exploit program does not itself require a dynamic linker.

In addition, the value in the /proc/sys/fs/protected_hardlinks file must zero. The protected_hardlinks file was a feature that was added in Linux 3.6 specifically to prevent the types of exploit discussed in this article. If this file has the value one, then only the owner of a file can create hard links to it. On a vanilla kernel, protected_hardlinks unfortunately has the default value zero, although some distributions provide kernels that change this default.

In the process of exploring this vulnerability, your editor discovered that set-user-ID binaries built as hardened, position-independent executables (PIE) cannot be used for this particular attack. (Many of the set-user-ID-root binaries on his Fedora system were hardened in this manner.) While PIE hardening thwarts this particular line of attack, the chroot() technique described here can still be used to exploit a set-user-ID-root binary in other ways. For example, the binary can be placed in a suitably constructed chroot environment that contains the genuine dynamic linker but a compromised libc.

Finally, user namespaces must of course be enabled on the system where this exploit is to be tested, and the kernel version needs to be precisely 3.8. Earlier kernel versions did not allow unprivileged users to create user namespaces, and later kernels will fix this bug, as described below. The exploit is unlikely to be possible with distributor kernels: because the Linux 3.8 kernel does not support the use of user namespaces with various filesystems, including NFS and XFS, distributors are unlikely to enable user namespaces in the kernels that they ship.

The fix

Once the problem was reported, Eric Biederman considered two possible solutions. The more complex solution is to create an association from a process's fs_struct , the kernel data structure that records the process's root directory, to a user namespace, and use that association to set limitations around the use of chroot() in scenarios such as the one described in this article. The alternative is the simple and obviously safe solution: disallow the combination of CLONE_NEWUSER and CLONE_FS in the clone() system call, make CLONE_NEWUSER automatically imply CLONE_FS in the unshare() system call, and disallow the use of setns() to change a process's user namespace if the process is sharing CLONE_FS -related attributes with another process.

Subsequently, Eric concluded that the complex solution seemed to be unnecessary and would add a small overhead to every call to fork() . He thus opted for the simple solution: the Linux 3.9 kernel (and the 3.8.3 stable kernel) will disallow the combination of CLONE_NEWUSER and CLONE_FS .

User namespaces and security

As we noted in an earlier article, Eric Biederman has put a lot of work into trying to ensure that unprivileged can create user namespaces without causing security vulnerabilities. Nevertheless, a significant exploit was found soon after the release of the first kernel version that allowed unprivileged processes to create user namespaces. Another user namespace vulnerability that potentially allowed unprivileged users to load arbitrary kernel modules was also reported and fixed earlier this month. In addition, during the discussion of the CLONE_NEWUSER|CLONE_FS issue, Andy Lutomirski has hinted that there may be another user namespaces vulnerability to be fixed.

Why is it that several security vulnerabilities have sprung from the user namespaces implementation? The fundamental problem seems to be that user namespaces and their interactions with other parts of the kernel are rather complex—probably too complex for the few kernel developers with a close interest to consider all of the possible security implications. In addition, by making new functionality available to unprivileged users, user namespaces expand the attack surface of the kernel. Thus, it seems that as user namespaces come to be more widely deployed, other security bugs such as these are likely to be found. One hopes that they'll be found and fixed by the kernel developers and white hat security experts, rather than found and exploited by black hat attackers.

Updated on 22 February 2013 to clarify and correct some minor details of the "simple and safe" solution under the heading, "The fix".

