vmsplice(): the making of a local root exploit

Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

vmsplice()

As this is being written, distributors are working quickly to ship kernel updates fixing the local root vulnerabilities in thesystem call. Unlike a number of other recent vulnerabilities which have required special situations (such as the presence of specific hardware) to exploit, these vulnerabilities are trivially exploited and the code to do so is circulating on the net. Your editor found himself wondering how such a wide hole could find its way into the core kernel code, so he set himself the task of figuring out just what was going on - a task which took rather longer than he had expected.

The splice() system call, remember, is a mechanism for creating data flow plumbing within the kernel. It can be used to join two file descriptors; the kernel will then read data from one of those descriptors and write it to the other in the most efficient way possible. So one can write a trivial file copy program which opens the source and destination files, then splices the two together. The vmsplice() variant connects a file descriptor (which must be a pipe) to a region of user memory; it is in this system call that the problems came to be.

The first step in understanding this vulnerability is that, in fact, it is three separate bugs. When the word of this problem first came out, it was thought to only affect 2.6.23 and 2.6.24 kernels. Changes to the vmsplice() code had caused the omission of a couple of important permissions checks. In particular, if the application had requested that vmsplice() move the contents of a pipe into a range of memory, the kernel didn't check whether that application had the right to write to that memory. So the exploit could simply write a code snippet of its choice into a pipe, then ask the kernel to copy it into a piece of kernel memory. Think of it as a quick-and-easy rootkit installation mechanism.

If the application is, instead, splicing a memory range into a pipe, the kernel must, first, read in one or more iovec structures describing that memory range. The 2.6.23 vmsplice() changes omitted a check on whether the purported iovec structures were in readable memory. This looks more like an information disclosure vulnerability than anything else - though, as we will see, it can be hard to tell sometimes.

These two vulnerabilities (CVE-2008-0009 and CVE-2008-0010) were patched in the 2.6.23.15 and 2.6.24.1 kernel updates, released on February 8.

On February 10, Niki Denev pointed out that the kernel appeared to be still vulnerable after the fix. In fact, the vulnerability was the result of a different problem - and it is a much worse one, in that kernels all the way back to 2.6.17 are affected. At this point, a large proportion of running Linux systems are vulnerable. This one has been fixed in the 2.6.22.18, 2.6.23.16, and 2.6.24.2 kernels, also released on the 10th. At this point, with luck, all of these bugs have been firmly stomped - though, now, we need to see a lot of distributor updates.

The problem, once again, is in the memory-to-pipe implementation. The function get_iovec_page_array() is charged with finding a set of struct page pointers corresponding to the array of iovec structures passed in by the calling application. Those pointers are stored in this array:

struct page *pages[PIPE_BUFFERS];

Where PIPE_BUFFERS happens to be 16. In order to avoid overflowing this array, get_iovec_page_array() does the following check:

npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; if (npages > PIPE_BUFFERS - buffers) npages = PIPE_BUFFERS - buffers;

Here, off is the offset into the first page of the memory to be transferred, len is the length passed in by the application, and buffers is the current index into the pages array.

Now, if we turn our attention to the exploit code for a moment, we see it setting up a number of memory areas with mmap() ; some of that setup is not necessary for the exploit to work, as it turns out. At the end, the code does this (edited slightly):

iov.iov_base = map_addr; iov.iov_len = ULONG_MAX; vmsplice(pi[1], &iov, 1, 0);

The map_addr address points to one of the areas created with mmap() which, crucially, is significantly more than PIPE_BUFFERS pages long. And the length is passed through as the largest possible unsigned long value.

Now let's go back to fs/splice.c , where the vmsplice() implementation lives. We note that, prior to the fix, the kernel did not check whether the memory area pointed to by the iovec structure was readable by the calling process. Once again, this looks like an information disclosure vulnerability - the process could cause any bit of kernel memory to be written to the pipe, from which it could be read. But the exploit code is, in fact, passing in a valid pointer - it's just the length which is clearly absurd.

Looking back at the code which calculates npages , we see something interesting:

npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT; if (npages > PIPE_BUFFERS - buffers) npages = PIPE_BUFFERS - buffers;

Since len will be ULONG_MAX when the exploit runs, the addition will cause an integer overflow - with the effect that npages is calculated to be zero. Which, one would think, would cause no pages to be examined at all. Except that there is an unfortunate interaction with another part of the kernel.

Once npages has been calculated, the next line of code looks like this:

error = get_user_pages(current, current->mm, (unsigned long) base, npages, 0, 0, &pages[buffers], NULL);

get_user_pages() is the core memory management function used to pin a set of user-space pages into memory and locate their struct page pointers. While the npages variable passed as an argument is an unsigned quantity, the prototype for get_user_pages() declares it as a simple int called len . And, to complete the evil, this function processes pages in a do {} while(); loop which ends thusly:

len--; } while (len && start < vma->vm_end);

So, if get_user_pages() is passed with a len argument of zero, it will pass through the mapping loop once, decrement len to a negative number, then continue faulting in pages until it hits an address which lacks a valid mapping. At that point it will stop and return. But, by then, it may have stored far more entries into the pages array than the caller had allocated space for.

The practical result in this case is that get_user_pages() faults in (and stores struct page pointers for) the entire region mapped by the exploit code. That region (by design) has more than PIPE_BUFFERS pages - in fact, it has three times that many, so 48 pointers get stored into a 16-pointer array. And this turns the failure to read-verify the source array into a buffer overflow vulnerability within the kernel. Once that is in place, it is a relatively straightforward exercise for any suitably 31337 hacker to cause the kernel to jump into the code of his or her choice. Game over. (Update: as a linux-kernel reader pointed out, the story is a little more complicated still at this point; this is an unusual sort of buffer overflow attack).

The fix which was applied simply checks the address range that the application is trying to splice into the pipe. Since a range of length ULONG_MAX is unlikely to be valid, the vulnerability is closed - as are any potential information disclosure problems.