Although these days I’m mostly known for application level networking and distributed systems, I spent the first part of my career working on operating systems and hypervisors. I maintain a deep fascination with the low level details of how modern processors and systems software work. When the recent Meltdown and Spectre vulnerabilities were announced, I dug into the available information and was eager to learn more.

The vulnerabilities are astounding; I would argue they are one of the most important discoveries in computer science in the last 10–20 years. The mitigations are also difficult to understand and accurate information about them is hard to find. This is not surprising given their critical nature. Mitigating the vulnerabilities has required months of secretive work by all of the major CPU, operating system, and cloud vendors. The fact that the issues were kept under wraps for 6 months when literally hundreds of people were likely working on them is amazing.

Although a lot has been written about Meltdown and Spectre since their announcement, I have not seen a good mid-level introduction to the vulnerabilities and mitigations. In this post I’m going to attempt to correct that by providing a gentle introduction to the hardware and software background required to understand the vulnerabilities, a discussion of the vulnerabilities themselves, as well as a discussion of the current mitigations.

Important note: Because I have not worked directly on the mitigations, and do not work at Intel, Microsoft, Google, Amazon, Red Hat, etc. some of the details that I am going to provide may not be entirely accurate. I have pieced together this post based on my knowledge of how these systems work, publicly available documentation, and patches/discussion posted to LKML and xen-devel. I would love to be corrected if any of this post is inaccurate, though I doubt that will happen any time soon given how much of this subject is still covered by NDA.

Background

In this section I will provide some background required to understand the vulnerabilities. The section glosses over a large amount of detail and is aimed at readers with a limited understanding of computer hardware and systems software.

Virtual memory

Virtual memory is a technique used by all operating systems since the 1970s. It provides a layer of abstraction between the memory address layout that most software sees and the physical devices backing that memory (RAM, disks, etc.). At a high level, it allows applications to utilize more memory than the machine actually has; this provides a powerful abstraction that makes many programming tasks easier.

Figure 1: Virtual memory

Figure 1 shows a simplistic computer with 400 bytes of memory laid out in “pages” of 100 bytes (real computers use powers of two, typically 4096). The computer has two processes, each with 200 bytes of memory across 2 pages each. The processes might be running the same code using fixed addresses in the 0–199 byte range, however they are backed by discrete physical memory such that they don’t influence each other. Although modern operating systems and computers use virtual memory in a substantially more complicated way than what is presented in this example, the basic premise presented above holds in all cases. Operating systems are abstracting the addresses that the application sees from the physical resources that back them.

Translating virtual to physical addresses is such a common operation in modern computers that if the OS had to be involved in all cases the computer would be incredibly slow. Modern CPU hardware provides a device called a Translation Lookaside Buffer (TLB) that caches recently used mappings. This allows CPUs to perform address translation directly in hardware the majority of the time.

Figure 2: Virtual memory translation

Figure 2 shows the address translation flow:

A program fetches a virtual address. The CPU attempts to translate it using the TLB. If the address is found, the translation is used. If the address is not found, the CPU consults a set of “page tables” to determine the mapping. Page tables are a set of physical memory pages provided by the operating system in a location the hardware can find them (for example the CR3 register on x86 hardware). Page tables map virtual addresses to physical addresses, and also contain metadata such as permissions. If the page table contains a mapping it is returned, cached in the TLB, and used for lookup. If the page table does not contain a mapping, a “page fault” is raised to the OS. A page fault is a special kind of interrupt that allows the OS to take control and determine what to do when there is a missing or invalid mapping. For example, the OS might terminate the program. It might also allocate some physical memory and map it into the process. If a page fault handler continues execution, the new mapping will be used by the TLB.

Figure 3: User/kernel virtual memory mappings

Figure 3 shows a slightly more realistic view of what virtual memory looks like in a modern computer (pre-Meltdown — more on this below). In this setup we have the following features:

Kernel memory is shown in red. It is contained in physical address range 0–99. Kernel memory is special memory that only the operating system should be able to access. User programs should not be able to access it.

User memory is shown in gray.

Unallocated physical memory is shown in blue.

In this example, we start seeing some of the useful features of virtual memory. Primarily:

User memory in each process is in the virtual range 0–99, but backed by different physical memory.

Kernel memory in each process is in the virtual range 100–199, but backed by the same physical memory.

As I briefly mentioned in the previous section, each page has associated permission bits. Even though kernel memory is mapped into each user process, when the process is running in user mode it cannot access the kernel memory. If a process attempts to do so, it will trigger a page fault at which point the operating system will terminate it. However, when the process is running in kernel mode (for example during a system call), the processor will allow the access.

At this point I will note that this type of dual mapping (each process having the kernel mapped into it directly) has been standard practice in operating system design for over thirty years for performance reasons (system calls are very common and it would take a long time to remap the kernel or user space on every transition).

CPU cache topology

Figure 4: CPU thread, core, package, and cache topology.

The next piece of background information required to understand the vulnerabilities is the CPU and cache topology of modern processors. Figure 4 shows a generic topology that is common to most modern CPUs. It is composed of the following components:

The basic unit of execution is the “CPU thread” or “hardware thread” or “hyper-thread.” Each CPU thread contains a set of registers and the ability to execute a stream of machine code, much like a software thread.

CPU threads are contained within a “CPU core.” Most modern CPUs contain two threads per core.

Modern CPUs generally contain multiple levels of cache memory. The cache levels closer to the CPU thread are smaller, faster, and more expensive. The further away from the CPU and closer to main memory the cache is the larger, slower, and less expensive it is.

Typical modern CPU design uses an L1/L2 cache per core. This means that each CPU thread on the core makes use of the same caches.

Multiple CPU cores are contained in a “CPU package.” Modern CPUs might contain upwards of 30 cores (60 threads) or more per package.

All of the CPU cores in the package typically share an L3 cache.

CPU packages fit into “sockets.” Most consumer computers are single socket while many datacenter servers have multiple sockets.

Speculative execution

Figure 5: Modern CPU execution engine (Source: Google images)

The final piece of background information required to understand the vulnerabilities is a modern CPU technique known as “speculative execution.” Figure 5 shows a generic diagram of the execution engine inside a modern CPU.

The primary takeaway is that modern CPUs are incredibly complicated and do not simply execute machine instructions in order. Each CPU thread has a complicated pipelining engine that is capable of executing instructions out of order. The reason for this has to do with caching. As I discussed in the previous section, each CPU makes use of multiple levels of caching. Each cache miss adds a substantial amount of delay time to program execution. In order to mitigate this, processors are capable of executing ahead and out of order while waiting for memory loads. This is known as speculative execution. The following code snippet demonstrates this.

if (x < array1_size) {

y = array2[array1[x] * 256];

}

In the previous snippet, imagine that array1_size is not available in cache, but the address of array1 is. The CPU might guess (speculate) that x is less than array1_size and go ahead and perform the calculations inside the if statement. Once array1_size is read from memory, the CPU can determine if it guessed correctly. If it did, it can continue having saved a bunch of time. If it didn’t, it can throw away the speculative calculations and start over. This is no worse than if it had waited in the first place.

Another type of speculative execution is known as indirect branch prediction. This is extremely common in modern programs due to virtual dispatch.

class Base {

public:

virtual void Foo() = 0;

}; class Derived : public Base {

public:

void Foo() override { … }

}; Base* obj = new Derived;

obj->Foo();

(The source of the previous snippet is this post)

The way the previous snippet is implemented in machine code is to load the “v-table” or “virtual dispatch table” from the memory location that obj points to and then call it. Because this operation is so common, modern CPUs have various internal caches and will often guess (speculate) where the indirect branch will go and continue execution at that point. Again, if the CPU guesses correctly, it can continue having saved a bunch of time. If it didn’t, it can throw away the speculative calculations and start over.

Meltdown vulnerability

Having now covered all of the background information, we can dive into the vulnerabilities.

Rogue data cache load

The first vulnerability, known as Meltdown, is surprisingly simple to explain and almost trivial to exploit. The exploit code roughly looks like the following:

1. uint8_t* probe_array = new uint8_t[256 * 4096];

2. // ... Make sure probe_array is not cached

3. uint8_t kernel_memory = *(uint8_t*)(kernel_address);

4. uint64_t final_kernel_memory = kernel_memory * 4096;

5. uint8_t dummy = probe_array[final_kernel_memory];

6. // ... catch page fault

7. // ... determine which of 256 slots in probe_array is cached

Let’s take each step above, describe what it does, and how it leads to being able to read the memory of the entire computer from a user program.

In the first line, a “probe array” is allocated. This is memory in our process which is used as a side channel to retrieve data from the kernel. How this is done will become apparent soon. Following the allocation, the attacker makes sure that none of the memory in the probe array is cached. There are various ways of accomplishing this, the simplest of which includes CPU-specific instructions to clear a memory location from cache. The attacker then proceeds to read a byte from the kernel’s address space. Remember from our previous discussion about virtual memory and page tables that all modern kernels typically map the entire kernel virtual address space into the user process. Operating systems rely on the fact that each page table entry has permission settings, and that user mode programs are not allowed to access kernel memory. Any such access will result in a page fault. That is indeed what will eventually happen at step 3. However, modern processors also perform speculative execution and will execute ahead of the faulting instruction. Thus, steps 3–5 may execute in the CPU’s pipeline before the fault is raised. In this step, the byte of kernel memory (which ranges from 0–255) is multiplied by the page size of the system, which is typically 4096. In this step, the multiplied byte of kernel memory is then used to read from the probe array into a dummy value. The multiplication of the byte by 4096 is to avoid a CPU feature called the “prefetcher” from reading more data than we want into into the cache. By this step, the CPU has realized its mistake and rolled back to step 3. However, the results of the speculated instructions are still visible in cache. The attacker uses operating system functionality to trap the faulting instruction and continue execution (e.g., handling SIGFAULT). In step 7, the attacker iterates through and sees how long it takes to read each of the 256 possible bytes in the probe array that could have been indexed by the kernel memory. The CPU will have loaded one of the locations into cache and this location will load substantially faster than all the other locations (which need to be read from main memory). This location is the value of the byte in kernel memory.

Using the above technique, and the fact that it is standard practice for modern operating systems to map all of physical memory into the kernel virtual address space, an attacker can read the computer’s entire physical memory.

Now, you might be wondering: “You said that page tables have permission bits. How can it be that user mode code was able to speculatively access kernel memory?” The reason is this is a bug in Intel processors. In my opinion, there is no good reason, performance or otherwise, for this to be possible. Recall that all virtual memory access must occur through the TLB. It is easily possible during speculative execution to check that a cached mapping has permissions compatible with the current running privilege level. Intel hardware simply does not do this. Other processor vendors do perform a permission check and block speculative execution. Thus, as far as we know, Meltdown is an Intel only vulnerability.

Edit: It appears that at least one ARM processor is also susceptible to Meltdown as indicated here and here.

Meltdown mitigations

Meltdown is easy to understand, trivial to exploit, and fortunately also has a relatively straightforward mitigation (at least conceptually — kernel developers might not agree that it is straightforward to implement).

Kernel page table isolation (KPTI)

Recall that in the section on virtual memory I described that all modern operating systems use a technique in which kernel memory is mapped into every user mode process virtual memory address space. This is for both performance and simplicity reasons. It means that when a program makes a system call, the kernel is ready to be used without any further work. The fix for Meltdown is to no longer perform this dual mapping.

Figure 6: Kernel page table isolation

Figure 6 shows a technique called Kernel Page Table Isolation (KPTI). This basically boils down to not mapping kernel memory into a program when it is running in user space. If there is no mapping present, speculative execution is no longer possible and will immediately fault.

In addition to making the operating system’s virtual memory manager (VMM) more complicated, without hardware assistance this technique will also considerably slow down workloads that make a large number of user mode to kernel mode transitions, due to the fact that the page tables have to be modified on each transition and the TLB needs to be flushed (given that the TLB may hold on to stale mappings).

Newer x86 CPUs have a feature known as ASID (address space ID) or PCID (process context ID) that can be used to make this task substantially cheaper (ARM and other microarchitectures have had this feature for years). PCID allows an ID to be associated with a TLB entry and then to only flush TLB entries with that ID. The use of PCID makes KPTI cheaper, but still not free.

In summary, Meltdown is an extremely serious and easy to exploit vulnerability. Fortunately it has a relatively straightforward mitigation that has already been deployed by all major OS vendors, the caveat being that certain workloads will run slower until future hardware is explicitly designed for the address space separation described.

Spectre vulnerability

Spectre shares some properties of Meltdown and is composed of two variants. Unlike Meltdown, Spectre is substantially harder to exploit, but affects almost all modern processors produced in the last twenty years. Essentially, Spectre is an attack against modern CPU and operating system design versus a specific security vulnerability.

Bounds check bypass (Spectre variant 1)

The first Spectre variant is known as “bounds check bypass.” This is demonstrated in the following code snippet (which is the same code snippet I used to introduce speculative execution above).

if (x < array1_size) {

y = array2[array1[x] * 256];

}

In the previous example, assume the following sequence of events:

The attacker controls x . array1_size is not cached. array1 is cached. The CPU guesses that x is less than array1_size . (CPUs employ various proprietary algorithms and heuristics to determine whether to speculate, which is why attack details for Spectre vary between processor vendors and models.) The CPU executes the body of the if statement while it is waiting for array1_size to load, affecting the cache in a similar manner to Meltdown. The attacker can then determine the actual value of array1[x] via one of various methods. (See the research paper for more details of cache inference attacks.)

Spectre is considerably more difficult to exploit than Meltdown because this vulnerability does not depend on privilege escalation. The attacker must convince the kernel to run code and speculate incorrectly. Typically the attacker must poison the speculation engine and fool it into guessing incorrectly. That said, researchers have shown several proof-of-concept exploits.

I want to reiterate what a truly incredible finding this exploit is. I do not personally consider this a CPU design flaw like Meltdown per se. I consider this a fundamental revelation about how modern hardware and software work together. The fact that CPU caches can be used indirectly to learn about access patterns has been known for some time. The fact that CPU caches can be used as a side-channel to dump computer memory is astounding, both conceptually and in its implications.

Branch target injection (Spectre variant 2)

Recall that indirect branching is very common in modern programs. Variant 2 of Spectre utilizes indirect branch prediction to poison the CPU into speculatively executing into a memory location that it never would have otherwise executed. If executing those instructions can leave state behind in the cache that can be detected using cache inference attacks, the attacker can then dump all of kernel memory. Like Spectre variant 1, Spectre variant 2 is much harder to exploit than Meltdown, however researchers have demonstrated working proof-of-concept exploits of variant 2.

Spectre mitigations

The Spectre mitigations are substantially more interesting than the Meltdown mitigation. In fact, the academic Spectre paper writes that there are currently no known mitigations. It seems that behind the scenes and in parallel to the academic work, Intel (and probably other CPU vendors) and the major OS and cloud vendors have been working furiously for months to develop mitigations. In this section I will cover the various mitigations that have been developed and deployed. This is the section I am most hazy on as it is incredibly difficult to get accurate information so I am piecing things together from various sources.

Static analysis and fencing (variant 1 mitigation)

The only known variant 1 (bounds check bypass) mitigation is static analysis of code to determine code sequences that might be attacker controlled to interfere with speculation. Vulnerable code sequences can have a serializing instruction such as lfence inserted which halts speculative execution until all instructions up to the fence have been executed. Care must be taken when inserting fence instructions as too many can have severe performance impacts.

Retpoline (variant 2 mitigation)

The first Spectre variant 2 (branch target injection) mitigation was developed by Google and is known as “retpoline.” It’s unclear to me whether it was developed in isolation by Google or by Google in collaboration with Intel. I would speculate that it was experimentally developed by Google and then verified by Intel hardware engineers, but I’m not sure. Details on the “retpoline” approach can be found in Google’s paper on the topic. I will summarize them here (I’m glossing over some details including underflow that are covered in the paper).

Retpoline relies on the fact the calling and returning from functions and the associated stack manipulations are so common in computer programs that CPUs are heavily optimized for performing them. (If you are not familiar with how the stack works in relation to calling and returning from functions this post is a good primer.) In a nutshell, when a “call” is performed, the return address is pushed onto the stack. “ret” pops the return address off and continues execution. Speculative execution hardware will remember the pushed return address and speculatively continue execution at that point.

The retpoline construction replaces an indirect jump to the memory location stored in register r11 :

jmp *%r11

with:

call set_up_target; (1)

capture_spec: (4)

pause;

jmp capture_spec;

set_up_target:

mov %r11, (%rsp); (2)

ret; (3)

Let’s see what the previous assembly code does one step at a time and how it mitigates branch target injection.

In this step the code calls a memory location that is known at compile time so is a hard coded offset and not indirect. This places the return address of capture_spec on the stack. The return address from the call is overwritten with the actual jump target. A return is performed on the real target. When the CPU speculatively executes, it will return into an infinite loop! Remember that the CPU will speculate ahead until memory loads are complete. In this case, the speculation has been manipulated to be captured into an infinite loop that has no side effects that are observable to an attacker. When the CPU eventually executes the real return it will abort the speculative execution which had no effect.

In my opinion, this is a truly ingenious mitigation. Kudos to the engineers that developed it. The downside to this mitigation is that it requires all software to be recompiled such that indirect branches are converted to retpoline branches. For cloud services such as Google that own the entire stack, recompilation is not a big deal. For others, it may be a very big deal or impossible.

IBRS, STIBP, and IBPB (variant 2 mitigation)

It appears that concurrently with retpoline development, Intel (and AMD to some extent) have been working furiously on hardware changes to mitigate branch target injection attacks. The three new hardware features being shipped as CPU microcode updates are:

Indirect Branch Restricted Speculation (IBRS)

Single Thread Indirect Branch Predictors (STIBP)

Indirect Branch Predictor Barrier (IBPB)

Limited information on the new microcode features are available from Intel here. I have been able to roughly piece together what these new features do by reading the above documentation and looking at Linux kernel and Xen hypervisor patches. From my analysis, each feature is potentially used as follows:

IBRS both flushes the branch prediction cache between privilege levels (user to kernel) and disables branch prediction on the sibling CPU thread. Recall that each CPU core typically has two CPU threads. It appears that on modern CPUs the branch prediction hardware is shared between the threads. This means that not only can user mode code poison the branch predictor prior to entering kernel code, code running on the sibling CPU thread can also poison it. Enabling IBRS while in kernel mode essentially prevents any previous execution in user mode and any execution on the sibling CPU thread from affecting branch prediction.

privilege levels (user to kernel) disables branch prediction on the sibling CPU thread. Recall that each CPU core typically has two CPU threads. It appears that on modern CPUs the branch prediction hardware is shared between the threads. This means that not only can user mode code poison the branch predictor prior to entering kernel code, code running on the sibling CPU thread can also poison it. Enabling IBRS while in kernel mode essentially prevents any previous execution in user mode and any execution on the sibling CPU thread from affecting branch prediction. STIBP appears to be a subset of IBRS that just disables branch prediction on the sibling CPU thread. As far as I can tell, the main use case for this feature is to prevent a sibling CPU thread from poisoning the branch predictor when running two different user mode processes (or virtual machines) on the same CPU core at the same time. It’s honestly not completely clear to me right now when STIBP should be used.

IBPB appears to flush the branch prediction cache for code running at the same privilege level. This can be used when switching between two user mode programs or two virtual machines to ensure that the previous code does not interfere with the code that is about to run (though without STIBP I believe that code running on the sibling CPU thread could still poison the branch predictor).

As of this writing, the main mitigations that I see being implemented for the branch target injection vulnerability appear to be both retpoline and IBRS. Presumably this is the fastest way to protect the kernel from user mode programs or the hypervisor from virtual machine guests. In the future I would expect both STIBP and IBPB to be deployed depending on the paranoia level of different user mode programs interfering with each other.

The cost of IBRS also appears to vary extremely widely between CPU architectures with newer Intel Skylake processors being relatively cheap compared to older processors. At Lyft, we saw an approximately 20% slowdown on certain system call heavy workloads on AWS C4 instances when the mitigations were rolled out. I would speculate that Amazon rolled out IBRS and potentially also retpoline, but I’m not sure. It appears that Google may have only rolled out retpoline in their cloud.

Over time, I would expect processors to eventually move to an IBRS “always on” model where the hardware just defaults to clean branch predictor separation between CPU threads and correctly flushes state on privilege level changes. The only reason this would not be done today is the apparent performance cost of retrofitting this functionality onto already released microarchitectures via microcode updates.

Conclusion

It is very rare that a research result fundamentally changes how computers are built and run. Meltdown and Spectre have done just that. These findings will alter hardware and software design substantially over the next 7–10 years (the next CPU hardware cycle) as designers take into account the new reality of the possibilities of data leakage via cache side-channels.

In the meantime, the Meltdown and Spectre findings and associated mitigations will have substantial implications for computer users for years to come. In the near-term, the mitigations will have a performance impact that may be substantial depending on the workload and specific hardware. This may necessitate operational changes for some infrastructures (for example, at Lyft we are aggressively moving some workloads to AWS C5 instances due to the fact that IBRS appears to run substantially faster on Skylake processors and the new Nitro hypervisor delivers interrupts directly to guests using SR-IOV and APICv, removing many virtual machine exits for IO heavy workloads). Desktop computer users are not immune either, due to proof-of-concept browser attacks using JavaScript that OS and browser vendors are working to mitigate. Additionally, due to the complexity of the vulnerabilities, it is almost certain that security researchers will find new exploits not covered by the current mitigations that will need to be patched.

Although I love working at Lyft and feel that the work we are doing in the microservice systems infrastructure space is some of the most impactful work being done in the industry right now, events like this do make me miss working on operating systems and hypervisors. I’m extremely jealous of the heroic work that was done over the last six months by a huge number of people in researching and mitigating the vulnerabilities. I would have loved to have been a part of it!

Further reading