Extending the use of RO and NX

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Pages of memory that are managed by the kernel are governed by access control flags that are somewhat analogous to the permissions which are applied to files. Those flags govern whether the page can be written to and whether its contents can be executed. Both attributes are useful to restrict what can happen to those pages in the presence of programming errors or security attacks. A pair of patches that were merged in the current merge window will further extend the usage of these flags for the x86 architecture.

The page access flags, unlike file permissions, are enforced by the memory management hardware. The flags of interest for these patches are "write" and "execute", both of which imply "read" access, so they are often specified as follows: RO+X (read-only and execute) or RW+NX (read-write and no-execute). By restricting the usage of these pages, the scope of security flaws can be reduced because, for example, a buffer overflow in an NX page will not be directly useful for code execution.

The memory that is used by the kernel to hold its read-only data (i.e. the .rodata segment) has been able to be marked read-only since 2.6.16 in early 2006, depending on the setting of CONFIG_DEBUG_RODATA . In 2.6.25, the kernel .rodata segment was additionally marked NX (i.e. no-execute), but only for the x86_64 architecture. A patch that was originally created for 2.6.30 (for both the 32 and 64-bit x86 architectures) expanded the use of NX for all kernel data pages, including read-write sections for initialized data and BSS.

That patch was created by Siarhei Liakh and Xuxian Jiang but had fallen by the wayside after causing some boot crashes on one of Ingo Molnar's test systems. When Kees Cook brought up the idea of doing better page access protection of the kernel's memory, Molnar remembered that Matthieu Castet had "dusted off those patches and submitted two of them", back in August. After a few iterations, Molnar pulled them into the -tip tree, and Linus Torvalds pulled that for the mainline in the current 2.6.38 merge window.

The revised patch itself is fairly straightforward. If CONFIG_DEBUG_RODATA is set, various sections of the kernel ( .text and .rodata ) are page aligned for both their start and end addresses. The NX bit is set for all pages from the end of the .text (i.e. code) section to the _end address that marks the end of the kernel's data section.

There were two other pieces of the puzzle addressed in the patch, the first of which was presumably the cause of the boot crashes that Molnar had with the earlier patch. Some older systems that use PCI BIOS require that some pages in the 640K-1M region be executable. There are also some ISA mappings that require read-write access to that region. Rather than try to work all of that out, and potentially run afoul of buggy hardware, the patch just sets pages in that region to be RW+X on systems where PCI BIOS is used. The second change simply modifies free_init_pages() to turn on NX for any pages that are freed that way, so that those pages have to be explicitly allowed to store executable code when they are reused.

A related patch adds read-only and no-execute flags to the pages used by kernel modules. It came from the same developers, and seems to have been dropped from -tip along with the NX patch. And, like the other patch, Castet pushed it the last bit to finally get it included in the mainline.

The patch splits the module_core and module_init regions into three parts: code, read-only data, and read-write data. Each of those parts is page aligned and the page access permissions are set just before load_module() returns. For the code pieces, RO+X are set, while the data parts get NX and either RO or RW depending on the type of data. These changes are all governed by the setting of CONFIG_DEBUG_SET_MODULE_RONX .

Beyond setting the page access control flags at module load time, the kernel must also reset those flags to RW+NX when the module is unloaded. In addition, the module_init region is freed after initialization is completed and its pages need to be put back to RW+NX. There is one further wrinkle: Ftrace needs to be able to modify the code in modules to enable tracepoints, so the patch provides a means for all module text pages to be set RW while Ftrace is making those changes, and then to set them back to RO afterward.

Marking the kernel module pages as RO and/or NX is important not only because it is consistent with how the rest of the kernel pages are handled, but also because it makes other kernel protection efforts actually work for modules. For example, there has been an effort to declare structures of function pointers as const , so that exploits cannot change the pointers for their own nefarious purposes, but that only works if the .rodata pages are actually marked RO.

The main cost of these patches is some bits of wasted memory from page aligning the various sections. Since that cost is probably not significant for any but the most resource-constrained embedded systems, it would make sense for CONFIG_DEBUG_RODATA and CONFIG_DEBUG_SET_MODULE_RONX to be turned on for most distributions—or to default to "on", though that is generally frowned upon by Torvalds and others.

The fact that these patches have been around for a while, but never quite made the jump into the mainline is unfortunate. There is no real person or group that is currently shepherding core kernel security patches along, though Cook and Dan Rosenberg have recently been making an effort to push these kinds of changes. Cook's query helped resurrect both of these patches; they might have languished far longer without that interest.

It is also worth noting that much or all of the protections embodied in these patches have long been available in the grsecurity/PaX kernels. While no wholesale import of the features from those kernels is ever going to happen, piecemeal patches that implement "sane" (at least in Torvalds's eyes) features can be adopted. That should lead to better kernel security, which is something that is certainly worth shooting for.

