Previously: v5.1.

Linux kernel v5.2 was released last week! Here are some security-related things I found interesting:

page allocator freelist randomization

While the SLUB and SLAB allocator freelists have been randomized for a while now, the overarching page allocator itself wasn’t. This meant that anything doing allocation outside of the kmem_cache / kmalloc() would have deterministic placement in memory. This is bad both for security and for some cache management cases. Dan Williams implemented this randomization under CONFIG_SHUFFLE_PAGE_ALLOCATOR now, which provides additional uncertainty to memory layouts, though at a rather low granularity of 4MB (see SHUFFLE_ORDER ). Also note that this feature needs to be enabled at boot time with page_alloc.shuffle=1 unless you have direct-mapped memory-side-cache (you can check the state at /sys/module/page_alloc/parameters/shuffle ).

stack variable initialization with Clang

Alexander Potapenko added support via CONFIG_INIT_STACK_ALL for Clang’s -ftrivial-auto-var-init=pattern option that enables automatic initialization of stack variables. This provides even greater coverage than the prior GCC plugin for stack variable initialization, as Clang’s implementation also covers variables not passed by reference. (In theory, the kernel build should still warn about these instances, but even if they exist, Clang will initialize them.) Another notable difference between the GCC plugins and Clang’s implementation is that Clang initializes with a repeating 0xAA byte pattern, rather than zero. (Though this changes under certain situations, like for 32-bit pointers which are initialized with 0x000000AA.) As with the GCC plugin, the benefit is that the entire class of uninitialized stack variable flaws goes away.

Kernel Userspace Access Prevention on powerpc

Like SMAP on x86 and PAN on ARM, Michael Ellerman and Russell Currey have landed support for disallowing access to userspace without explicit markings in the kernel (KUAP) on Power9 and later PPC CPUs under CONFIG_PPC_RADIX_MMU=y (which is the default). This is the continuation of the execute protection (KUEP) in v4.10. Now if an attacker tries to trick the kernel into any kind of unexpected access from userspace (not just executing code), the kernel will fault.

Microarchitectural Data Sampling mitigations on x86

Another set of cache memory side-channel attacks came to light, and were consolidated together under the name Microarchitectural Data Sampling (MDS). MDS is weaker than other cache side-channels (less control over target address), but memory contents can still be exposed. Much like L1TF, when one’s threat model includes untrusted code running under Symmetric Multi Threading (SMT: more logical cores than physical cores), the only full mitigation is to disable hyperthreading (boot with “ nosmt “). For all the other variations of the MDS family, Andi Kleen (and others) implemented various flushing mechanisms to avoid cache leakage.

unprivileged userfaultfd sysctl knob

Both FUSE and userfaultfd provide attackers with a way to stall a kernel thread in the middle of memory accesses from userspace by initiating an access on an unmapped page. While FUSE is usually behind some kind of access controls, userfaultfd hadn’t been. To avoid various heap grooming and heap spraying techniques for exploiting Use-after-Free flaws, Peter Xu added the new “ vm.unprivileged_userfaultfd ” sysctl knob to disallow unprivileged access to the userfaultfd syscall.

temporary mm for text poking on x86

The kernel regularly performs self-modification with things like text_poke() (during stuff like alternatives, ftrace, etc). Before, this was done with fixed mappings (“fixmap”) where a specific fixed address at the high end of memory was used to map physical pages as needed. However, this resulted in some temporal risks: other CPUs could write to the fixmap, or there might be stale TLB entries on removal that other CPUs might still be able to write through to change the target contents. Instead, Nadav Amit has created a separate memory map for kernel text writes, as if the kernel is trying to make writes to userspace. This mapping ends up staying local to the current CPU, and the poking address is randomized, unlike the old fixmap.

ongoing: implicit fall-through removal

Gustavo A. R. Silva is nearly done with marking (and fixing) all the implicit fall-through cases in the kernel. Based on the pull request from Gustavo, it looks very much like v5.3 will see -Wimplicit-fallthrough added to the global build flags and then this class of bug should stay extinct in the kernel.

CLONE_PIDFD added

Christian Brauner added the new CLONE_PIDFD flag to the clone() system call, which complements the pidfd work in v5.1 so that programs can now gain a handle for a process ID right at fork() (actually clone() ) time, instead of needing to get the handle from /proc after process creation. With signals and forking now enabled, the next major piece (already in linux-next) will be adding P_PIDFD to the waitid() system call, and common process management can be done entirely with pidfd.

Other things

Alexander Popov pointed out some more v5.2 features that I missed in this blog post. I’m repeating them here, with some minor edits/clarifications. Thank you Alexander!

Edit: added CLONE_PIDFD notes, as reminded by Christian Brauner. :)

Edit: added Alexander Popov’s notes

That’s it for now; let me know if you think I should add anything here. We’re almost to -rc1 for v5.3!

© 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

