This article originally appeared on lwn.net.

Suppose you have a program running on your system that you don’t quite trust. Maybe it’s a program submitted by a student to an automated grading system. Or maybe it’s a QEMU device model running in a Xen control domain ("domain 0" or “dom0”), and you want to make sure that even if an attacker from a rogue virtual machine manages to take over the QEMU process, they can’t do any further harm. There are many things you want to do as far as restricting its ability to do mischief. But one thing in particular you probably want to do is to be able to reliably kill the process once you think it should be done. This turns out to be quite a bit more tricky than you’d think.

Avoiding kill with fork

So here’s our puzzle. Suppose we have a process that we’ve run with its own individual user ID (UID), which we want to kill. But the code in the process is currently controlled by an attacker who doesn’t want it to be killed.

We obviously know the process ID (PID) of the initial process we forked, so we could just use the kill() system call:

kill(pid, 9);

So how can an attacker avoid this? It turns out to be pretty simple:

while(1) { if (fork()) _exit(0); }

This simple snippet of code will repeatedly call fork() . As you probably know, fork() returns twice: once in the existing parent process (returning the PID of the newly-created child), and once in a newly-created child process (returning 0 ). In the loop above, the parent will always call _exit() , and the child will call fork() again. The result is that the program races through the process ID space as fast as the kernel will let it.

I encourage you to run the above code snippet (preferably in a virtual machine), and see what it looks like. It’s not even very noticeable. Running top shows a system load of about 50% (in my virtual machine anyway), but there’s not obviously any particular process contributing to that load; everything is still responsive and functional. If you didn’t know about it, you might never notice it was there.

Now try killing it. You can run killall to try to kill the process by name, but it will frequently fail with "no process killed"; even when it succeeds, it often turns out that you’ve killed the parent process after the fork() but before the _exit() , so the rogue forking process is still going strong. Even determining whether you’ve managed to kill the process or not is a

challenge.

The basic problem here is a race condition. What killall does is:

Read the list of processes, looking for one with the specified name Call kill(pid, sig) on each one found

In between 1 and each instance of 2, the kernel tasklist lock is released (since it has to return from the system call), giving the rogue process a chance to fork. Indeed, it has many chances; since the second step takes a non-negligible amount of time, by the time you manage to find the rogue process, it’s likely already forked, and perhaps even exited.

It’s true, if we ran killall 1000 times, the rogue process would very likely end up dead; and if we ran ps 1000 times, and found no trace of the process, we might be pretty sure that it was gone. On the other hand, that assumes that the "race" is fair, and that the attacker hasn’t discovered some way of making sure that the race ends up going their way. It would be best if we didn’t rely on these sorts of probabilistic calculations to clean things up.

Better mousetraps?

One thing to do, of course, would be to try to prevent the process from executing fork() in the first place. This could be done on Linux using the seccomp() call; but it’s Linux-specific. (Xen, for example, wants to be able to support NetBSD and FreeBSD control domains, so it can’t rely on this for correctness.) Another would be to use the setrlimit() system call to set RLIMIT_NPROC to 0 . This should, in theory, prevent the process from calling fork() (since by definition there would already be one process with its user ID running).

But RLIMIT_NPROC has had its own set of issues in the past. Setting it to 0 would also break a lot of perfectly legitimate code. Surely there must be a way to kill a process in a way that it can’t evade, without relying on being able to take away fork() . Looking more closely at the kill() man page, it turns out that the pid argument can be interpreted in four possible ways:

pid > 0: PID of a single process to kill

> 0: PID of a single process to kill pid < -1: the negative of the ID of a

process group ( pgid )

to kill

< -1: the negative of the ID of a process group ( ) to kill pid == 0: Kill every process in my current process group

== 0: Kill every process in my current process group pid == -1: Kill every process that I’m allowed to kill

At first glance it seems like killing by pgid might do what we want. To run our untrusted process, set the pgid and the user ID; to kill it, we call kill(-pgid, 9) .

Unfortunately, unlike the user ID, the pgid can be changed by unprivileged processes. So our attacker could simply run something like the following to avoid being killed in the same way:

while(1) { if (fork()) _exit(0); setpgid(0, 0); }

In this case, the child process changes its pgid to match its PID as soon as it forks, making kill(-pgid) as racy as kill(pid) .

A better mousetrap: kill -1

What about the last one — "kill every process I’m allowed to kill"? Well we obviously don’t want to run that as root unless we want to nuke the entire system; we want to limit "all processes I’m allowed to kill" to the particular user ID we’ve given to the rogue process.

In general, processes are allowed to kill other processes with their own UID; so what about something like the following?

setuid(uid); kill(-1, 9);

(Note that for simplicity error handling is omitted in these examples; but when playing with kill() you should certainly make sure that you did switch your UID.)

The kill() system call, when called with -1 , will loop over the entire task list, attempting to send the signal to each process except the one making the system call. The tasklist lock is held for the entire loop, so the rogue process cannot complete a fork() ; since the UIDs match, it will be killed.

Done, right? Not quite. If we simply call setuid() , then not only can we kill the rogue process, but the rogue process can also kill us:

while(1) { if (fork()) _exit(0); kill(-1, 9); setpgid(0, 0); }

If the rogue process manages to get its own kill(-1) in after we’ve called setuid() but before we’ve called kill() ourselves, we will be the ones to disappear. So to successfully kill the rogue process, we still need to win a race — something we’d rather not rely on.

A better mousetrap: exploiting asymmetry

If we want to reliably kill the other process without putting ourselves at risk of being killed, we must find an asymmetry that allows the “reaper” process to do so. If we look carefully at the kill() man page, we find:

For a process to have permission to send a signal, it must either be privileged (under Linux: have the CAP_KILL capability in the user namespace of the target process), or the real or effective user ID of the sending process must equal the real or saved set-user-ID of the target process.

So there is an asymmetry. Each process has an effective UID ( euid ), real UID ( ruid ), and saved UID ( suid ). For process A to kill process B, A’s ruid or euid must match one of B’s ruid or suid .

When we started our target process, we set all of its UIDs to a specific value ( target_uid ). Can we construct a <euid, ruid, suid> tuple for our "reaper" process to use that will allow it to kill the rogue process, and no other processes, but not be able to be killed by the rogue process?

It turns out that we can. If we create a new reaper_uid , and set its <euid, ruid, suid> to <target_uid, reaper_uid, X> (where X can be anything as long as it’s not target_uid ), then:

The reaper process can kill the target process, since its effective UID is equal to the target process’s real UID

But the target process can’t kill the reaper, since its real and effective UIDs are different than the real and saved UIDs of the reaper process.

So the following code will safely kill all processes of target_uid in a race-free way:

setresuid(reaper_uid, target_uid, reaper_uid); kill(-1, 9);

Note that this reaper_uid must have no other running processes when we call kill() , or they will be killed as well. In practice this means either setting aside a single reaper_uid (and using a lock to make sure only one reaper process runs at a time) or having a separate reaper_uid per target_uid .

Proof-of-concept code for both the rogue process and the reaper process can be found in this GitHub repository.

No POSIX-compliant mousetraps?

The setresuid() system call is implemented by both Linux and FreeBSD. It is not currently implemented by NetBSD, but implementing it seems like a pretty straightforward exercise (and certainly a lot simpler than implementing seccomp() ). NetBSD does implement RLIMIT_NPROC , which should also be helpful at preventing our process from executing fork() .

On the other hand, neither setresuid() nor RLIMIT_NPROC are in the current POSIX specification. It seems impossible to get a process to have the required tuple using only the current POSIX interfaces (namely setuid() and setreuid() , without recourse to setresuid() or Linux’s CAP_SETUID ); the assumption seems to be that euid must always be set to either ruid or suid . So there would seem to be no way within that specification to safely prevent a potentially rogue process from using fork() to evade kill() .

Acknowledgments