On May 16, 2011, Fenghua Yu submitted a series of patches to the upstream Linux kernel implementing support for a new Intel CPU feature: Supervisor Mode Execution Protection (SMEP). This feature is enabled by toggling a bit in the cr4 register, and the result is the CPU will generate a fault whenever ring0 attempts to execute code from a page marked with the user bit.

First, some background on why this feature is useful. Like most mainstream operating systems, the vanilla Linux kernel does not leverage x86 segmentation, instead defining flat segment descriptors with limits encompassing the entire 4gb address space. Additionally, each process has the kernel’s page table entries replicated, resulting in the kernel address space being mapped in the upper 1gb of every user process. Both of these decisions are for performance reasons: reloading segment selectors at every trap and kernel-to-user (or vice versa) copy operation introduces a non-negligible (but not necessarily unacceptable) performance hit, and having completely separate user and kernel address spaces would necessitate a TLB flush on every trap, which is even more expensive.

The result of this is that the kernel is free to incorrectly access data residing in userspace, as well as execute code in the user region. In addition to enabling the exploitation of many bugs that rely on the kernel incorrectly using user data, this allows kernel exploits to simply map a suitable payload in userspace and divert kernel execution to that payload.

The PaX project solves this problem in a general way with a feature called PAX_UDEREF. When this feature is enabled, PaX leverages segmentation to isolate user and kernel addresses, such that a fault will be generated when the kernel incorrectly accesses user data or code. Unfortunately, due to the performance hit associated with reloading segment registers and the fact that this touches mission-critical code, it’s unlikely that this solution would be accepted into the upstream Linux kernel.

Update: I’m told by the PaX team that recent benchmarks have shown there is almost no measurable performance impact for UDEREF on i386, as reloading segment registers has become much cheaper since the initial benchmarks of this feature (on the order of 16 cycles). However, it’s still unlikely that the upstream kernel would find the feature suitable, since using segmentation would be a significant departure from current kernel design principles.

Enter SMEP. Now, the mainline Linux kernel can take advantage of a subset of this protection at essentially no performance cost, as the functionality is presumably implemented in hardware in a way that’s similar to existing CPL checks. With SMEP enabled, it’s no longer possible to map exploit payloads in userland, as the CPU will trigger a fault if it attempts to execute those user pages in kernel mode. Note that this is still only a subset of what UDEREF protects against, as it does nothing to prevent the kernel from incorrectly accessing user *data* as opposed to code. But it’s certainly a start.

It may take awhile for the hardware to catch up – it doesn’t seem any existing CPUs actually implement SMEP, and we all know how long adoption of hardware NX has taken (and continues to take). However, once SMEP is widespread, what are kernel exploit writers going to do? Is this the end of Linux kernel exploits?

Of course not. While SMEP is definitely a very good security feature and is a step in the right direction, no single feature is going to “win security”. Let’s go into a few ways to bypass this protection (I’m sure there are more).

RWX Kernel Pages

The first problem is the kernel’s page permissions aren’t yet in a completely sane state. By compiling a kernel with CONFIG_X86_PTDUMP (or using Kees Cook’s modularized version of this feature), we can take a look at the permissions of kernel pages via the /sys/kernel/debug/kernel_page_tables debugfs file. In particular, we’re interested in pages that are both writable and executable:

# grep RW /sys/kernel/debug/kernel_page_tables | grep -v NX 0xc009b000-0xc009f000 16K RW GLB x pte 0xc00a0000-0xc0100000 384K RW GLB x pte 0xc1400000-0xc1580000 1536K RW GLB x pte

The first two regions are especially useful, since they will appear at static addresses on many modern 32-bit kernels. The first region is reserved for the BIOS, and the second is the so-called “I/O hole” used for DMA. While it’s probably best to avoid scribbling all over the I/O hole, as it’s commonly used at runtime, there’s no reason that writing into the BIOS region would cause any stability issues after booting is complete.

So, if we have a kernel write primitive, all we have to do is write our payload into the BIOS region and divert execution there. If the target kernel leaks symbol locations via /proc/kallsyms or similar, then diverting execution is a simple matter of resolving the address of a suitable function pointer, overwriting it, and triggering it. Otherwise, it’s trivial to issue a sidt instruction to retrieve the address of the IDT and set up a trap handler pointing into the payload. SMEP will have nothing to complain about, since we never cause the kernel to attempt to execute from user pages.

Stack Metadata

A second way to bypass this protection is to leverage the addr_limit variable, which resides in the thread_info structure at the base of each process’ kernel stack.

As described in Jon Oberheide’s and my presentation on Stackjacking, it’s possible to exploit the leakage of uninitialized stack data, a common bug, in order to infer the address of the base of a process’ kernel stack. I developed a library called libkstack to do so generically. Once this address is inferred, a kernel write vulnerability can simply write ULONG_MAX ( 0xffffffff ) into the addr_limit variable, which is at a reliable offset from the kernel stack base. At this point, arbitrary kernel memory can be read from and written to, since all kernel copy functions will accept kernel pointers as user arguments. For example, you can do a write(pipefd, kernel_addr, len) to read the data from kernel_addr into a pipe, to be retrieved later. Once you have an arbitrary kernel read and write, the current process’ cred structure can be found and written into, escalating privileges to root. Again, this attack does not require executing any user code with kernel privileges, so SMEP cannot stop it.

Update: it’s worth noting that grsecurity protects against this type of attack by removing the thread_info structure from the kernel stack.

Return-Oriented Programming

In the event that kernel symbols can be resolved on the target kernel (especially common on distro kernels) and the attacker has a stack overflow or another vulnerability that allows pivoting the stack pointer into an area of attacker controlled data, kernel ROP is possible. Fortunately, the setup_smep function, which has code to both enable and disable the SMEP bit in the cr4 register, is marked __init , so it’s likely to have been cleaned up by the kernel after initialization and is not a good candidate for ROP. However, more complex ROP payloads are certainly possible, as I hope to demonstrate later this year. For now, I’ll leave this up to your imagination. 😉

What Needs to be Fixed

Some progress on removing useful sources of information leakage has been made with the kptr_restrict and dmesg_restrict sysctls. Continued work on plugging similar leaks should improve the usefulness of these features. However, it’s still trivial to resolve the locations of kernel code and data on distribution kernels, since they are shipped as binaries that are identical across all machines with the same kernel version. This is demonstrated perfectly by Jon Oberheide’s ksymhunter project.

The solution I’m currently working on is implementing randomization of the address at which the kernel is decompressed at boot. This way, even if an attacker can download an identical kernel image as the target host, he won’t know where kernel data and code resides in a running kernel, assuming an absence of information leakage. In order to be effective, this solution requires relocating the IDT – otherwise, it will reside at the location pointed to by the idt_table symbol, and an sidt instruction would allow an attacker to calculate the offsets of every other kernel symbol relative to the address of the IDT. This has its own challenges, but I’m making progress and hope to submit a working version in the coming weeks. This will also have the useful side effect of marking the IDT read-only, which will prevent it from being a generic target for kernel write vulnerabilities.

Next, more work needs to be done on making sure page protections in the kernel are sane. Most importantly, RWX mappings should be removed and function pointer tables should be enforced read-only. Fortunately, efforts are underway in this area as well, with help from Kees Cook.

Hopefully, with the combined efforts to remove information leakage via restricting leaks and kernel image randomization, stronger page protections in the kernel, and SMEP, the Linux kernel will have significantly raised the bar for exploitation.