Post-init read-only memory

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

At the 2015 Kernel Summit , the assembled developers discussed the idea of incorporating more security-hardening patches into the kernel. As part of that effort, it was agreed that taking another look at the out-of-tree grsecurity patches made sense. The first fruit from this work would appear to be the post-init read-only memory patch set from Kees Cook. This work has been received well, but it also highlights some of the difficulties involved with hardening a general-purpose kernel.

The key to a successful exploit is often convincing the kernel to write to an unintended location. See, for example, this recent exploit, which uses a driver bug to overwrite a portion of the vDSO area; that, in turn, enables an attacker to run arbitrary code in kernel mode. One way to defend against such attacks is to minimize, to the greatest extent possible, the memory that the kernel is allowed to write to. A number of techniques, from simply marking data read-only to supervisor-mode access prevention, can be deployed toward that end. There is one class of data, identified by the grsecurity developers, that current techniques overlook, however.

When the kernel boots, it sets up a vast array of data structures describing the hardware it runs on and much more. In many cases, those data structures will never be changed again but, since they are resident in writable memory, they can still be changed by an errant write operation. The post-init read-only memory patch set, as posted by Kees, allows these data structures to be marked with a special __read_only annotation. That will cause them to be placed into a separate ELF section (" .data..read_only "). Once the kernel has finished the initialization process, all data found in that section will be marked read-only, never to be changed again. At that point, exploits like the vDSO overwrite linked above will no longer work.

This change seems like an obvious win: unchanging data is marked read-only, blocking known exploits and, perhaps, minimizing the impact of simple bugs as well. As an added bonus, read-only data will be kept together, leading to better cache behavior. It would appear to be an obvious candidate for merging in the near future. That will probably come to pass, but, first, an important question has to be answered: what should happen when the hardware catches an attempt by the kernel to write (post initialization) memory that had been marked __read_only ?

When things go wrong

This question matters because there is a potential hazard whenever a data structure is marked __read_only : the developer involved may have overlooked the one case where, after a rare sequence of events on days with a waxing gibbous moon, that data structure must be changed. Or there may be a case where data structures are modified unnecessarily, perhaps storing data that is already there anyway. Such cases work in current kernels, but would break if the data being written were made read-only. Mathias Krause described one such experience, wherein the system would fail during the resume sequence. As he noted: "Debugging that kind of problem is sort of a PITA, you could imagine."

The ideal solution would be to have the compiler catch attempts to modify __read_only data outside of the initialization sequence, but that is not currently possible. Simply marking the relevant data structures const will not work; those data structures are written to during boot and, as PaX Team pointed out, making them const opens the door to all kinds of surprising, optimization-related behavior from the compiler. Where compilers are involved, surprising behavior is rarely a good thing. As an alternative, Mathias suggested the use of a special-purpose GCC module to detect inappropriate writes. There seems to be agreement that this is a good idea, but no such module exists and it will take time to create one. Holding this patch set until a checker module can be created seems undesirable.

But without such a checker, there will almost certainly be situations where the kernel tries to write to something marked __read_only , either because it was so marked in error or as the result of some other bug. There have been a number of ideas put forward on how such problems could be handled.

The most obvious thing to do is to simply oops the kernel, with the usual results for the process that was running and, perhaps, the machine as a whole. Andy Lutomirski supported this approach, saying: "We failed, we might be under attack, let's oops." The problem with this approach, of course, is that it takes the machine out of commission, possibly with an error that is less than fun to try to track down. Ingo Molnar also worried that the oops information would, in most desktop cases, never be seen by the user and, as a result, would never be reported to developers. That highlights an old problem with presenting such information on desktop systems, but that problem is unlikely to be fixed right now.

The alternative to oopsing the system would be to log the error and somehow try to continue. Ingo suggested simply skipping over the offending instruction and trying to continue, but that idea did not go far; as PaX Team pointed out, simply dropping an intended write operation could create no end of strange problems further down the line and may actually help exploit attempts. Linus suggested, instead, that the kernel could mark the relevant page writable and retry the instruction. That would, of course, remove the read-only protection from that page, but it would allow the system to continue to operate while generating diagnostic information for developers. One would probably not want things to work this way on a production system, but it could be an invaluable option for developers.

The final piece of the puzzle might be to have a kernel command-line operation to disable the read-only marking entirely. That would provide an option to users who run into a bug and need to be able to get their work done until a proper fix is available.

Kees has indicated that his current approach is to take the kill-the-machine approach by default. He has already implemented the command-line option, and said that Linus's "mark the page writable" suggestion would not be difficult to add. So the next version of the patch should have addressed most of the concerns expressed so far. Getting it merged may prove to be the easy part, though; the task of identifying and marking truly read-only data could be a long and error-prone affair, even when starting with the work that the grsecurity developers have already done. The good news is that this work should make the kernel more secure, provide a (perhaps imperceptible) performance improvement, and turn up a few bugs along the way.

