Previously: v4.16.

Linux kernel v4.17 was released last week, and here are some of the security things I think are interesting:

Jailhouse hypervisor

Jan Kiszka landed Jailhouse hypervisor support, which uses static partitioning (i.e. no resource over-committing), where the root “cell” spawns new jails by shrinking its own CPU/memory/etc resources and hands them over to the new jail. There’s a nice write-up of the hypervisor on LWN from 2014.

Sparc ADI

Khalid Aziz landed the userspace support for Sparc Application Data Integrity (ADI or SSM: Silicon Secured Memory), which is the hardware memory coloring (tagging) feature in Sparc M7. I’d love to see this extended into the kernel itself, as it would kill linear overflows between allocations, since the base pointer being used is tagged to belong to only a certain allocation (sized to a multiple of cache lines). Any attempt to increment beyond, into memory with a different tag, raises an exception. Enrico Perla has some great write-ups on using ADI in allocators and a comparison of ADI to Intel’s MPX.

new kernel stacks cleared on fork

It was possible that old memory contents would live in a new process’s kernel stack. While normally not visible, “uninitialized” memory read flaws or read overflows could expose these contents (especially stuff “deeper” in the stack that may never get overwritten for the life of the process). To avoid this, I made sure that new stacks were always zeroed. Oddly, this “priming” of the cache appeared to actually improve performance, though it was mostly in the noise.

MAP_FIXED_NOREPLACE

As part of further defense in depth against attacks like Stack Clash, Michal Hocko created MAP_FIXED_NOREPLACE . The regular MAP_FIXED has a subtle behavior not normally noticed (but used by some, so it couldn’t just be fixed): it will replace any overlapping portion of a pre-existing mapping. This means the kernel would silently overlap the stack into mmap or text regions, since MAP_FIXED was being used to build a new process’s memory layout. Instead, MAP_FIXED_NOREPLACE has all the features of MAP_FIXED without the replacement behavior: it will fail if a pre-existing mapping overlaps with the newly requested one. The ELF loader has been switched to use MAP_FIXED_NOREPLACE , and it’s available to userspace too, for similar use-cases.

pin stack limit during exec

I used a big hammer and pinned the RLIMIT_STACK values during exec. There were multiple methods to change the limit (through at least setrlimit() and prlimit() ), and there were multiple places the limit got used to make decisions, so it seemed best to just pin the values for the life of the exec so no games could get played with them. Too much assumed the value wasn’t changing, so better to make that assumption actually true. Hopefully this is the last of the fixes for these bad interactions between stack limits and memory layouts during exec (which have all been defensive measures against flaws like Stack Clash).

Variable Length Array removals start

Following some discussion over Alexander Popov’s ongoing port of the stackleak GCC plugin, Linus declared that Variable Length Arrays (VLAs) should be eliminated from the kernel entirely. This is great because it kills several stack exhaustion attacks, including weird stuff like stepping over guard pages with giant stack allocations. However, with several hundred uses in the kernel, this wasn’t going to be an easy job. Thankfully, a whole bunch of people stepped up to help out: Gustavo A. R. Silva, Himanshu Jha, Joern Engel, Kyle Spiers, Laura Abbott, Lorenzo Bianconi, Nikolay Borisov, Salvatore Mesoraca, Stephen Kitt, Takashi Iwai, Tobin C. Harding, and Tycho Andersen. With Linus Torvalds and Martin Uecker, I also helped rewrite the max() macro to eliminate false positives seen by the -Wvla compiler option. Overall, about 1/3rd of the VLA instances were solved for v4.17, with many more coming for v4.18. I’m hoping we’ll have entirely eliminated VLAs by the time v4.19 ships.

Edit: I missed this next feature when I first published this post:

syscall register clearing, x86

One of the ways attackers can influence potential speculative execution flaws in the kernel is to leak information into the kernel via “unused” register contents. Most syscalls take only a few arguments, so all the other calling-convention-defined registers can be cleared instead of just left with whatever contents they had in userspace. As it turns out, clearing registers is very fast. Dominik Brodowski expanded a proof-of-concept from Linus Torvalds into a full register-clearing syscall wrapper on x86.

That’s it for now! Please let me know if you think I missed anything. Stay tuned for v4.18; the merge window is open. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

