Subject [PATCH] x86/doc: add PTI description From Dave Hansen <> Date Thu, 04 Jan 2018 12:54:58 -0800

This got kicked out of the PTI set as the implementation diverged

from its contents. I've updated it so it can hopefully rejoin the

set.



---



From: Dave Hansen <dave.hansen@linux.intel.com>



Add some details about how PTI works, what some of the downsides

are, and how to debug it when things go wrong.



Also document the kernel parameter: 'nopti'.



Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>

Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>

Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>

Cc: Richard Fellner <richard.fellner@student.tugraz.at>

Cc: Andy Lutomirski <luto@kernel.org>

Cc: Linus Torvalds <torvalds@linux-foundation.org>

Cc: Kees Cook <keescook@google.com>

Cc: Hugh Dickins <hughd@google.com>

Cc: x86@kernel.org

---



b/Documentation/admin-guide/kernel-parameters.txt | 11 +

b/Documentation/x86/pti.txt | 174 ++++++++++++++++++++++

2 files changed, 182 insertions(+), 3 deletions(-)



diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc Documentation/admin-guide/kernel-parameters.txt

--- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc 2018-01-03 17:04:23.255028797 -0800

+++ b/Documentation/admin-guide/kernel-parameters.txt 2018-01-03 17:07:06.058028391 -0800

@@ -2712,8 +2712,6 @@

steal time is computed, but won't influence scheduler

behaviour



- nopti [X86-64] Disable kernel page table isolation

-

nolapic [X86-32,APIC] Do not enable or use the local APIC.



nolapic_timer [X86-32,APIC] Do not use the local APIC timer.

@@ -3288,12 +3286,19 @@

pt. [PARIDE]

See Documentation/blockdev/paride.txt.



- pti= [X86_64]

+ pti= [X86_64] Disable Page Table Isolation of user and

+ kernel address spaces. Disabling this feature

+ removes hardening, but improves performance of

+ system calls and interrupts.

+

Control user/kernel address space isolation:

on - enable

off - disable

auto - default setting



+ nopti [X86_64]

+ Equivalent to pti=off

+

pty.legacy_count=

[KNL] Number of legacy pty's. Overwrites compiled-in

default number.

diff -puN /dev/null Documentation/x86/pti.txt

--- /dev/null 2017-12-15 13:48:30.454245127 -0800

+++ b/Documentation/x86/pti.txt 2018-01-04 12:54:05.667850771 -0800

@@ -0,0 +1,174 @@

+Overview

+========

+

+Page Table Isolation (pti, previously known as KAISER[1]) is a

+countermeasure against attacks on kernel address information such

+as the "Meltdown" approach[2].

+

+To avoid leaking address information, we create an new, independent

+copy of the page tables which are used only when running userspace

+applications. When the kernel is entered via syscalls, interrupts or

+exceptions, page tables are switched to the full "kernel" copy. When

+the system switches back to user mode, the user copy is used again.

+

+The userspace page tables contain only a minimal amount of kernel

+data: only what is needed to enter/exit the kernel such as the

+entry/exit functions themselves and the interrupt descriptor table

+(IDT). There are a few strictly unnecessary things that get mapped

+such as the first C function when entering an interrupt (see comments

+in pti.c).

+

+This approach helps to ensure that side-channel attacks that leverage

+the paging structures do not function when PTI is enabled. It can be

+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.

+Once enabled at compile-time, it can be disabled at boot with the

+'nopti' or 'pti=' kernel parameter (see kernel-parameters.txt).

+

+Page Table Management

+=====================

+

+When PTI is enabled, the kernel manages two sets of page

+tables. The first copy is very similar to what would be present

+for a kernel without PTI. This includes a complete mapping of

+userspace that the kernel can use for things like copy_to_user().

+

+The userspace copy is used when running userspace and mirrors the

+mapping of userspace present in the kernel copy. It maps a only

+the kernel data needed to enter and exit the kernel. This data

+is entirely contained in the 'struct cpu_entry_area' structure

+which is placed in the fixmap and thus each CPU's copy of the

+area has a compile-time-fixed virtual address.

+

+For new userspace mappings, the kernel makes the entries in its

+page tables like normal. The only difference is when the kernel

+makes entries in the top (PGD) level. In addition to setting the

+entry in the main kernel PGD, a copy of the entry is made in the

+userspace page tables' PGD.

+

+This sharing at the PGD level also inherently shares all the lower

+layers of the page tables. This leaves a single, shared set of

+userspace page tables to manage. One PTE to lock, one set set of

+accessed bits, dirty bits, etc...

+

+Overhead

+========

+

+Protection against side-channel attacks is important. But,

+this protection comes at a cost:

+

+1. Increased Memory Use

+ a. Each process now needs an order-1 PGD instead of order-0.

+ (Consumes 4k per process).

+ b. The 'cpu_entry_area' structure must be 2MB in size and 2MB

+ aligned so that it can be mapped by setting a single PMD

+ entry. This consumes nearly 2MB of RAM once the kernel

+ is decompressed, but no space in the kernel image itself.

+

+2. Runtime Cost

+ a. CR3 manipulation to switch between the page table copies

+ must be done at interrupt, syscall, and exception entry

+ and exit (it can be skipped when the kernel is interrupted,

+ though.) Moves to CR3 are on the order of a hundred

+ cycles, and are required every at entry and every at exit.

+ b. A "trampoline" must be used for SYSCALL entry. This

+ trampoline depends on a smaller set of resources than the

+ non-PTI SYSCALL entry code, so requires mapping fewer

+ things into the userspace page tables. The downside is

+ that stacks must be switched at entry time.

+ d. Global pages are disabled for all kernel structures not

+ mapped in both to kernel and userspace page tables. This

+ feature of the MMU allows different processes to share TLB

+ entries mapping the kernel. Losing the feature means more

+ TLB misses after a context switch. The actual loss of

+ performance is very small, however, never exceeding 1%.

+ d. Process Context IDentifiers (PCID) is a CPU feature that

+ allows us to skip flushing the entire TLB when switching page

+ tables. This makes switching the page tables (at context

+ switch, or kernel entry/exit) cheaper. But, on systems with

+ PCID support, the context switch code must flush both the user

+ and kernel entries out of the TLB. The user PCID TLB flush is

+ deferred until the exit to userspace, minimizing the cost.

+ e. The userspace page tables must be populated for each new

+ process. Even without PTI, the shared kernel mappings

+ are created by copying top-level (PGD) entries into each

+ new process. But, with PTI, there are now *two* kernel

+ mappings: one in the kernel page tables that maps everything

+ and one for the entry/exit structures. At fork(), we need to

+ copy both.

+ f. In addition to the fork()-time copying, there must also

+ be an update to the userspace PGD any time a set_pgd() is done

+ on a PGD used to map userspace. This ensures that the kernel

+ and userspace copies always map the same userspace

+ memory.

+ g. On systems without PCID support, each CR3 write flushes

+ the entire TLB. That means that each syscall, interrupt

+ or exception flushes the TLB.

+

+Possible Future Work

+====================

+1. We can be more careful about not actually writing to CR3

+ unless its value is actually changed.

+2. Allow PTI to enabled/disabled at runtime in addition to the

+ boot-time switching.

+

+Testing

+========

+

+To test stability of PTI, the following test procedure is recommended,

+ideally doing all of these in parallel:

+

+1. Set CONFIG_DEBUG_ENTRY=y

+2. Run several copies of all of the tools/testing/selftests/x86/ tests

+ (excluding MPX and protection_keys) in a loop on multiple CPUs for

+ several minutes. These tests frequently uncover corner cases in the

+ kernel entry code. In general, old kernels might cause these tests

+ themselves to crash, but they should never crash the kernel.

+3. Run the 'perf' tool in a mode (top or record) that generates many

+ frequent performance monitoring non-maskable interrupts (see "NMI"

+ in /proc/interrupts). This exercises the NMI entry/exit code which

+ is known to trigger bugs in code paths that did not expect to be

+ interrupted, including nested NMIs. Using "-c" boosts the rate of

+ NMIs, and using two -c with separate counters encourages nested NMIs

+ and less deterministic behavior.

+

+ while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done

+

+4. Launch a KVM virtual machine.

+5. Run 32-bit binaries on systems supporting the SYSCALL instruction.

+ This has been a lightly-tested code path and needs extra scrutiny.

+

+Debugging

+=========

+

+Bugs in PTI cause a few different signatures of crashes

+that are worth noting here.

+

+ * Failures of the selftests/x86 code. Usually a bug in one of the

+ more obscure corners of entry_64.S

+ * Crashes in early boot, especially around CPU bringup. Bugs

+ in the trampoline code or mappings cause these.

+ * Crashes at the first interrupt. Caused by bugs in entry_64.S,

+ like screwing up a page table switch. Also caused by

+ incorrectly mapping the IRQ handler entry code.

+ * Crashes at the first NMI. The NMI code is separate from main

+ interrupt handlers and can have bugs that do not affect

+ normal interrupts. Also caused by incorrectly mapping NMI

+ code. NMIs that interrupt the entry code must be very

+ careful and can be the cause of crashes that show up when

+ running perf.

+ * Kernel crashes at the first exit to userspace. entry_64.S

+ bugs, or failing to map some of the exit code.

+ * Crashes at first interrupt that interrupts userspace. The paths

+ in entry_64.S that return to userspace are sometimes separate

+ from the ones that return to the kernel.

+ * Double faults: overflowing the kernel stack because of page

+ faults upon page faults. Caused by touching non-pti-mapped

+ data in the entry code, or forgetting to switch to kernel

+ CR3 before calling into C functions which are not pti-mapped.

+ * Userspace segfaults early in boot, sometimes manifesting

+ as mount(8) failing to mount the rootfs. These have

+ tended to be TLB invalidation issues. Usually invalidating

+ the wrong PCID, or otherwise missing an invalidation.

+

+1. https://gruss.cc/files/kaiser.pdf

+2. https://meltdownattack.com/meltdown.pdf

_

