Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I’m going to do a brief tour of the various kernel features that collided to make this bug possible, and explain how they combine to turn an otherwise-boring oops into privilege escalation.

When a user application passes a pointer to the kernel, and the kernel wants to read or write from that pointer, the kernel needs to perform various checks that a buggy or malicious userspace app hasn’t passed an “evil” pointer.

Because the kernel and userspace run in the same address space, the most important check is simply that the pointer points into the “userspace” part of the address space. User applications are protected by page table permissions from writing into kernel memory, but the kernel isn’t, and so must explicitly check that any pointers given to it by a user don’t point into the kernel region.

The address space is laid out such that user applications get the bottom portion, and the kernel gets the top, so this check is a simple comparison against that boundary. The kernel function that performs this check is called access_ok , although there are various other functions that do the same check, implicitly or otherwise.

Occasionally, however, the kernel finds it useful to change the rules for what access_ok will allow. set_fs() is an internal Linux function that is used to override the definition of the user/kernel split, for the current process.

After a set_fs(KERNEL_DS) , no checking is performed that user pointers point to userspace – access_ok will always return true. set_fs(KERNEL_DS) is mainly used to enable the kernel to wrap functions that expect user pointers, by passing them pointers into the kernel address space. A typical use reads something like this:

old_fs = get_fs(); set_fs(KERNEL_DS); vfs_readv(file, kernel_buffer, len, &pos); set_fs(old_fs);

vfs_readv expects a user-provided pointer, so without the set_fs() , the access_ok() inside vfs_readv() would fail on our kernel buffer, so we use set_fs() to effectively temporarily disable that checking.

Kernel oopses 🔗︎

When the kernel oopses, perhaps because of a NULL pointer dereference in kernelspace, or because of a call to the BUG() macro to indicate an assertion failure, the kernel attempts to clean up, and then tries to kill the current process by calling the do_exit() function to exit the current process.

When the kernel does so, it’s still running in the same process context it was before the oops occured, including any set_fs() override, if applicable. Which means that do_exit will get called with access_ok disabled – not something anyone expected when they wrote the individual pieces of this system.

As it turns out, do_exit contains a write to a user-controlled address that expects access_ok to be working properly!

clear_child_tid is a feature where, on thread exit, the kernel can be made to write a zero into a specified address in that thread’s address space, in order to notify other threads of that exit.

This is implemented by simply storing a pointer to the to-be-zeroed address inside struct task_struct (which represents a single thread or process), and, on exit, mm_release , called from do_exit , does:

put_user(0, tsk->clear_child_tid);

This is normally safe, because put_user checks that its second argument falls into the “userspace” segment before doing a write. But, if we are running with get_fs() == KERNEL_DS , it will happily accept any address at all, even one pointing into kernel space.

So, if we find any kernel BUG() or NULL dereference, or other page fault, that we can trigger after a set_fs(KERNEL_DS) , we can trick the kernel into a user-controlled write into kernel memory!

An obvious question at this point is: How much of the kernel can an attacker cause to run with get_fs() == KERNEL_DS ?

There are a number of small special cases. For example, the binary sysctl compatibility code works by calling the normal /proc/ write handlers from kernelspace, under set_fs() . handful of compat-mode (32 on 64) syscalls work similarly.

By far the biggest source I’ve found, however, is the splice() system call. The splice() system call is a relatively recent addition to Linux, and allows for zero-copy transfer of pages between a pipe and another file descriptor.

As of 2.6.31, attempts to splice() to or from an fd that doesn’t support special handling to actually do zero-copy splice , will fall back on doing an ordinary read() , write() , or sendmsg() on the fd … from the kernel, using set_fs() in order to pass in kernel buffers.

What that means it that by using splice() , an attacker can call the bulk of the code in most obscure filesystems and socket types (which tend not to have explicit splice() support) with a segment override in place. Conveniently for an attacker, that is also exactly a description of where the bulk of the random security bugs tend to be.

This is also exactly the technique Dan’s exploit uses. He uses CVE-2010-3849, an otherwise boring NULL pointer dereference I reported in the Econet network protocol. His exploit code does a splice() to an econet socket, causing the econet_sendsmg handler to get called under set_fs(KERNEL_DS) . When it oopses, do_exit is called, and he gets a user-controlled write into kernel memory. Everything else is just details.