During a performance evaluation, an unfortunate interaction of the STACKLEAK plugin with the RAP plugin was noticed that lead to unnecessary bloat. This blog post highlights the steps that have been taken by our new team member, Mathias Krause, to resolve the source of the problem.

Background on STACKLEAK and RAP

First, some background. You can skip this section if you're familiar with the purpose of the STACKLEAK and RAP plugins.

STACKLEAK

STACKLEAK was introduced back in 2011 as a coarse-grained countermeasure to address stack-based infoleaks. Its main idea is to wipe the kernel stack on syscall exit (sometimes also on entry) to prevent leaking any sensitive information from previous syscalls via uninitialized stack variables.

As wiping the full kernel stack on every syscall is quite a performance killer — it's 16k on x86-64 these days — the plugin tries to track how much stack space is actually used. A function — pax_track_stack() — regularly stores the current stack pointer value into the task-specific variable lowest_stack that gets used on syscall exit as a start address to wipe from. The plugin will inject calls to pax_track_stack() into the generated code for eligible functions, either ones with a large enough stack frame or users of alloca() .

Technically this instrumentation happens in two passes: The first pass injects a call to pax_track_stack() in the function prologue of every function. The second one removes the call again for functions that fail the eligibility test, i.e. make no use of alloca() and have a small enough stack frame. The reason for doing it in two passes is that the final stack frame size is only known very late in the compilation phase. However, at that stage, no more additional calls to pax_track_stack() can be injected.

RAP

The RAP plugin was released to the public in April 2016. It provides a security property commonly known as Control Flow Integrity (CFI), preventing exploit techniques that try to execute code out of intended order or even try to execute unintended instructions (return-to-libc, borrowed code chunks, ROP, etc.). To achieve this goal, it weaves identifiers into the code stream that mark legitimate call sites and return locations, which we call RAP hashes. These identifiers are 64-bit values that are generated by hashing the type information of the called function.

As RAP hash values rarely represent valid instructions, let alone instructions without side effects, these identifiers need to be skipped during execution of the code. For call sites, this is as easy as prepending the RAP hash in front of the function. The function symbol still refers to the first instruction, so the RAP hash check in the caller just needs to look 8 bytes in front of it to find the hash.

The return hash, on the other hand, needs to be spatially nearby the call instruction, as it's the return address that needs to be verified. Therefore the RAP hash needs to be part of the actual instruction stream that gets executed — without actually getting executed. Skipping the RAP return hash from getting executed is as simple as jumping over it. Here's an example:

ffffffff81000000 <_stext>: [...] ffffffff81000014: jmp ffffffff81000023 <_stext+0x23> ffffffff81000016: movabs $0xffffffffba01b6dd,%rax ffffffff81000020: int3 ffffffff81000021: int3 ffffffff81000022: int3 ffffffff81000023: callq ffffffff81000220 <__startup_64> ffffffff81000028: [...]

The JMP instruction skips the embedded RAP return hash ( 0xffffffffba01b6dd ) and continues execution at the CALL instruction. The MOVABS following the JMP will never get executed. It's just to please disassemblers attempting to decode the RAP hash as instructions. The INT3 is just "padding" to ensure the RAP hash will always be at a fixed offset from the return address — -16 bytes in this case.

Control Flow Integrity is achieved by instrumenting all calls, indirect jumps and returns so they provide, and check, these hash values. Calls to pax_track_stack() , generated by the STACKLEAK plugin, will be instrumented as well. This is where it starts to get interesting.

The Issue

The STACKLEAK plugin adds calls to pax_track_stack() that the RAP plugin later instruments by embedding return hashes. However, the STACKLEAK plugin might remove the call in a later pass if the function is deemed to not make use of alloca() and has a stack frame size smaller than the threshold. In this case, however, the RAP return hash remains. This leads to code like the following:

ffffffff81215e10 <rap_sys_mmap_pgoff>: ffffffff81215e10: push %r15 ffffffff81215e12: mov %r9,%r15 ffffffff81215e15: push %r14 ffffffff81215e17: mov %r8,%r14 ffffffff81215e1a: push %r13 ffffffff81215e1c: mov %rcx,%r13 ffffffff81215e1f: push %r12 ffffffff81215e21: mov %rdx,%r12 ffffffff81215e24: push %rbp ffffffff81215e25: mov %rsi,%rbp ffffffff81215e28: push %rbx ffffffff81215e29: mov %rdi,%rbx ffffffff81215e2c: ==> jmp ffffffff81215e40 <rap_sys_mmap_pgoff+0x30> ffffffff81215e2e: ==> movabs $0xffffffffdb9d6e07,%rax ffffffff81215e38: ==> int3 ffffffff81215e39: ==> int3 ffffffff81215e3a: ==> int3 ffffffff81215e3b: ==> int3 ffffffff81215e3c: ==> int3 ffffffff81215e3d: ==> int3 ffffffff81215e3e: ==> int3 ffffffff81215e3f: ==> int3 ffffffff81215e40: mov %r15,%r9 ffffffff81215e43: mov %r14,%r8 ffffffff81215e46: mov %r13,%rcx ffffffff81215e49: mov %r12,%rdx ffffffff81215e4c: mov %rbp,%rsi ffffffff81215e4f: mov %rbx,%rdi ffffffff81215e52: jmp ffffffff81215e61 <rap_sys_mmap_pgoff+0x51> ffffffff81215e54: movabs $0xffffffffd6c086f5,%rax ffffffff81215e5e: int3 ffffffff81215e5f: int3 ffffffff81215e60: int3 ffffffff81215e61: callq ffffffff81215af0 <sys_mmap_pgoff> ffffffff81215e66: mov 0x30(%rsp),%rdx ffffffff81215e6b: cmpq $0xffffffffd6c086f5,-0x10(%rdx) ffffffff81215e73: jne ffffffff81215e88 <rap_sys_mmap_pgoff+0x78> ffffffff81215e75: pop %rbx ffffffff81215e76: pop %rbp ffffffff81215e77: pop %r12 ffffffff81215e79: pop %r13 ffffffff81215e7b: pop %r14 ffffffff81215e7d: pop %r15 ffffffff81215e7f: btsq $0x3f,(%rsp) ffffffff81215e85: retq ffffffff81215e86: ud2 ffffffff81215e88: ud1 (%rax),%edx ffffffff81215e8b: nopl 0x0(%rax,%rax,1)

Two things stand out:

The lines with arrows next to them above contain an "empty" RAP return hash code fragment. Empty, because there's no subsequent call that would check the RAP hash woven into the code. The function prologue moves all function arguments (per System V AMD64 ABI passed in registers RDI, RSI, RDX, RCX, R8 and R9) to a new set of registers for no obvious reason — they get moved back immediately after the empty RAP return hash sequence.

Problem 1 arises because the STACKLEAK plugin removed its call to pax_track_stack() but not the RAP return hash, since it doesn't know it's there.

Problem 2 is, again, a remnant of the removed function call. The aforementioned registers are so-called caller-saved registers, meaning their values don't need to be preserved across function calls. Thus the compiler had to move them somewhere else for the following call to sys_mmap_pgoff() . In this case it did so by storing them into callee-saved registers, which are guaranteed to be preserved across function calls by the ABI. However, as sys_mmap_pgoff() expects the very same arguments, it had to move them all back to the original registers.

While Problem 1 is specific to a STACKLEAK and RAP interaction, Problem 2 is STACKLEAK-specific and so even can be seen in upstream's version. Here's a quick additional example of a function without upstream STACKLEAK:

0000000000000900 <wake_page_function>: 900: callq 905 <wake_page_function+0x5> 901: R_X86_64_PLT32 __fentry__-0x4 905: mov (%rcx),%rax 908: cmp %rax,-0x10(%rdi) 90c: je 911 <wake_page_function+0x11> 90e: xor %eax,%eax 910: retq 911: movl $0x1,0xc(%rcx) 918: movslq 0x8(%rcx),%r8 91c: cmp %r8d,-0x8(%rdi) 920: jne 90e <wake_page_function+0xe> 922: bt %r8,(%rax) 926: jb 92d <wake_page_function+0x2d> 928: jmpq 92d <wake_page_function+0x2d> 929: R_X86_64_PLT32 autoremove_wake_function-0x4 92d: mov $0xffffffff,%eax 932: retq

And with upstream STACKLEAK (again, with the track_stack call eliminated), increasing the function's size by 45% and its number of instructions by 56%:

0000000000000a50 <wake_page_function>: a50: callq a55 <wake_page_function+0x5> a51: R_X86_64_PLT32 __fentry__-0x4 a55: push %r13 a57: mov %edx,%r13d a5a: push %r12 a5c: mov %esi,%r12d a5f: push %rbp a60: mov %rdi,%rbp a63: push %rbx a64: mov %rcx,%rbx a67: mov (%rbx),%rax a6a: cmp %rax,-0x10(%rbp) a6e: je a79 <wake_page_function+0x29> a70: xor %eax,%eax a72: pop %rbx a73: pop %rbp a74: pop %r12 a76: pop %r13 a78: retq a79: movl $0x1,0xc(%rbx) a80: movslq 0x8(%rbx),%rdx a84: cmp %edx,-0x8(%rbp) a87: jne a70 <wake_page_function+0x20> a89: bt %rdx,(%rax) a8d: jb aa6 <wake_page_function+0x56> a8f: mov %rbx,%rcx a92: mov %r13d,%edx a95: pop %rbx a96: mov %r12d,%esi a99: mov %rbp,%rdi a9c: pop %rbp a9d: pop %r12 a9f: pop %r13 aa1: jmpq aa6 <wake_page_function+0x56> aa2: R_X86_64_PLT32 autoremove_wake_function-0x4 aa6: mov $0xffffffff,%eax aab: jmp a72 <wake_page_function+0x22> aad: nopl (%rax)

All of this leads to unfortunate, unnecessary code bloat.

The Fix

Fixing the spurious register spilling is rather easy. We just need to tell the compiler that pax_track_stack() will follow a special calling convention that will preserve all register values, so the caller doesn't need to preserve any caller-saved registers itself. Unfortunately there's no gcc function attribute to do that on a per-function level. But we can change this on a per-compilation-unit level with the help of the -fcall-saved-* compiler switch.

Now that pax_track_stack() itself will take care of preserving all register values, we need to fix up the callers that don't know about the special calling convention. They would still preserve and restore caller-saved registers which would make the code bloat problem even worse, as now two entities would be preserving the register values instead of just one.

Luckily, there are no direct callers of pax_track_stack() . The only callers are the ones generated by the STACKLEAK plugin itself. So the lack of a gcc function attribute is no real issue, as we can control the call sites by modifying the plugin.

To achieve this goal (and solve the second problem at the same time) we need to hide the call from the compiler. If it's unaware that we're calling a function, it won't try to emit code that would preserve caller-saved registers. Basically, what we want to do is change the GIMPLE call to the equivalent inline assembly construct " asm volatile ("call pax_track_stack") ". However, we can do better by using pax_direct_call , a macro provided by the RAP plugin that will take care of embedding the RAP return hash for backward edge checks. Moreover, this ASM statement can now be completely removed in a later pass in case the STACKLEAK plugin decided no call to pax_track_stack() is needed — removing the embedded RAP return hash as well.

Going back to the initial example with RAP enabled, this is what gets generated with these fixes in place:

ffffffff811ecfe0 <rap_sys_mmap_pgoff>: ffffffff811ecfe0: jmp ffffffff811ecfef <rap_sys_mmap_pgoff+0xf> ffffffff811ecfe2: movabs $0xffffffffd6c086f5,%rax ffffffff811ecfec: int3 ffffffff811ecfed: int3 ffffffff811ecfee: int3 ffffffff811ecfef: callq ffffffff811eccc0 <sys_mmap_pgoff> ffffffff811ecff4: mov (%rsp),%rdx ffffffff811ecff8: cmpq $0xffffffffd6c086f5,-0x10(%rdx) ffffffff811ed000: jne ffffffff811ed00b <rap_sys_mmap_pgoff+0x2b> ffffffff811ed002: btsq $0x3f,(%rsp) ffffffff811ed008: retq ffffffff811ed009: ud2 ffffffff811ed00b: ud1 (%rax),%edx ffffffff811ed00e: xchg %ax,%ax

The above is much shorter — no unneeded RAP return hash anymore and no unneeded register shuffling.

Real Case Impact

These optimizations are an improvement, but do they bring any tangible benefit when calls to pax_track_stack() are actually needed? Let's take a look at a function with an actual call to pax_track_stack() :

With the above modifications, the code looks like this:

ffffffff81180120 <__bpf_prog_run384>: ffffffff81180120: sub $0x1e0,%rsp ffffffff81180127: jmp ffffffff81180136 <__bpf_prog_run384+0x16> ffffffff81180129: movabs $0xffffffffdb9d6e07,%rax ffffffff81180133: int3 ffffffff81180134: int3 ffffffff81180135: int3 ffffffff81180136: callq ffffffff810729a0 <pax_track_stack> ffffffff8118013b: lea 0x1e0(%rsp),%rax ffffffff81180143: mov %rdi,0x8(%rsp) ffffffff81180148: lea 0x60(%rsp),%rdx ffffffff8118014d: mov %rsp,%rdi ffffffff81180150: mov %rax,0x50(%rsp) ffffffff81180155: jmp ffffffff81180164 <__bpf_prog_run384+0x44> ffffffff81180157: movabs $0xffffffffc45cf82b,%rax ffffffff81180161: int3 ffffffff81180162: int3 ffffffff81180163: int3 ffffffff81180164: callq ffffffff8117e850 <___bpf_prog_run> ffffffff81180169: mov 0x1e0(%rsp),%rdx ffffffff81180171: cmpq $0xffffffffba9431ed,-0x10(%rdx) ffffffff81180179: jne ffffffff8118018b <__bpf_prog_run384+0x6b> ffffffff8118017b: add $0x1e0,%rsp ffffffff81180182: btsq $0x3f,(%rsp) ffffffff81180188: retq ffffffff81180189: ud2 ffffffff8118018b: ud1 (%rax),%edx ffffffff8118018e: xchg %ax,%ax

The old code looks as follows (additional instructions marked with arrows):

ffffffff811a1eb0 <__bpf_prog_run384>: ffffffff811a1eb0: ==> push %rbp ffffffff811a1eb1: ==> mov %rdi,%rbp ffffffff811a1eb4: ==> push %rbx ffffffff811a1eb5: ==> mov %rsi,%rbx ffffffff811a1eb8: sub $0x1e0,%rsp ffffffff811a1ebf: jmp ffffffff811a1ece <__bpf_prog_run384+0x1e> ffffffff811a1ec1: movabs $0xffffffffdb9d6e07,%rax ffffffff811a1ecb: int3 ffffffff811a1ecc: int3 ffffffff811a1ecd: int3 ffffffff811a1ece: callq ffffffff81269c40 <pax_track_stack> ffffffff811a1ed3: ==> mov %rbx,%rsi ffffffff811a1ed6: lea 0x1e0(%rsp),%rax ffffffff811a1ede: lea 0x60(%rsp),%rdx ffffffff811a1ee3: mov %rsp,%rdi ffffffff811a1ee6: mov %rbp,0x8(%rsp) ffffffff811a1eeb: mov %rax,0x50(%rsp) ffffffff811a1ef0: jmp ffffffff811a1eff <__bpf_prog_run384+0x4f> ffffffff811a1ef2: movabs $0xffffffffc45cf82b,%rax ffffffff811a1efc: int3 ffffffff811a1efd: int3 ffffffff811a1efe: int3 ffffffff811a1eff: callq ffffffff811a0590 <___bpf_prog_run> ffffffff811a1f04: mov 0x1f0(%rsp),%rdx ffffffff811a1f0c: cmpq $0xffffffffba9431ed,-0x10(%rdx) ffffffff811a1f14: jne ffffffff811a1f28 <__bpf_prog_run384+0x78> ffffffff811a1f16: add $0x1e0,%rsp ffffffff811a1f1d: ==> pop %rbx ffffffff811a1f1e: ==> pop %rbp ffffffff811a1f1f: btsq $0x3f,(%rsp) ffffffff811a1f25: retq ffffffff811a1f26: ud2 ffffffff811a1f28: ud1 (%rax),%edx ffffffff811a1f2b: nopl 0x0(%rax,%rax,1)

Not as dramatic a change as for rap_sys_mmap_pgoff() , but still less code.

The new version of __bpf_prog_run384() skips preserving RDI and RSI for the call to pax_track_stack() as those are no longer clobbered by that function. This saves us 7 instructions.

vmlinux size

The overall impact can be seen below by comparing the sizes of a defconfig kernel build:

$ size vmlinux-* text data bss dec hex filename 28418610 9145961 2775180 40339751 2678927 vmlinux-4.14-grsec 26120133 8367721 2759628 37247482 23859fa vmlinux-4.14-grsec+patch

2MB less kernel code (-8%) and ~760KB less data (-8.5%) — Great! — Wait! Less data? But this was all about reducing code size, right? Well, let's take a deeper look:

$ size -A vmlinux-* vmlinux-4.14-grsec : section size addr .text 16130514 18446744071578845184 [...] .orc_unwind_ip 3555016 18446744071603335328 .orc_unwind 5332524 18446744071606890344 .orc_lookup 252044 18446744071612222868 [...] .init.begin 2015232 18446744071612481536 [...] Total 40340163 vmlinux-4.14-grsec+patch : section size addr .text 15135186 18446744071578845184 [...] .orc_unwind_ip 3033716 18446744071603335328 .orc_unwind 4550574 18446744071606369044 .orc_lookup 236492 18446744071610919620 [...] .init.begin 1236992 18446744071611162624 [...] Total 37247894

In fact, the actual text size reduction is only ~972KB. The remainder comes from the ORC unwinder tables which shrunk by ~1.2MB (-14.4%) as there are fewer instructions to take care of (the JMP and INT3 embedded into the RAP return hash).

The data size reduction, however, ends up not affecting the overall binary size. Even though the .init.begin section is smaller, no real data was dropped. It's the enforced alignment in arch/x86/kernel/vmlinux.lds.S that causes this:

.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) { BYTE(0) #ifdef CONFIG_PAX_KERNEXEC . = ALIGN(HPAGE_SIZE); #else . = ALIGN(PAGE_SIZE); #endif __init_begin = .; /* paired with __init_end */ } :init.begin

$ nm -n vmlinux-4.14-grsec | grep -v __rap_hash_ | grep -C1 __init_begin ffffffff83013080 D vsyscall_gtod_data ffffffff83200000 T __init_begin ffffffff83200000 A init_per_cpu__irq_stack_union $ nm -n vmlinux-4.14-grsec+patch | grep -v __rap_hash_ | grep -C1 __init_begin ffffffff82ed1080 D vsyscall_gtod_data ffffffff83000000 T __init_begin ffffffff83000000 A init_per_cpu__irq_stack_union

The vmlinux-4.14-grsec kernel was just "unlucky" to require lots of padding for the alignment, while the vmlinux-4.14-grsec+patch one did not.

Availability

These enhancements are available in all grsecurity stable patches as of 02/18/2020.