Recent news reports speak of a new security vulnerability known as "TLBleed", a novel form of timing side-channel attack exploiting the tightly-coupled nature of the shared resources found in some high-performance microprocessors that implement Simultaneous MultiThreading (SMT). As reported, Intel’s implementation of SMT, commonly known as "Hyper-Threading" is exploited in order to observe the activity of a sibling hyperthread running code vulnerable to timing analysis through its activity in a common data-side Translation Lookaside Buffer (TLB) shared by peer threads.

This blog post will deconstruct this scenario and talk a little about how we are collaborating with industry peers in order to mitigate the impact on our customers and their sensitive data.

Timing side-channel attacks aren’t new, but the mechanism (exploiting TLB implementation) used in TLBleed is novel in other respects. We’ll look at TLBs as a side channel later, but first, it’s worth understanding what is meant by a timing attack or timing side-channel. As a refresher, side-channels are unintended mechanisms through which data can be leaked about a system’s operation. Many modern exploits leverage side-channel attacks. In fact, in recent times we’ve seen computer cache side-channels (such as in Meltdown and Spectre) become popular in both academic research as well as published vulnerabilities. You may recall our earlier blogs and materials describing some of them. Other common side-channels include "differential power analysis", through which power utilization of a machine can be precisely monitored.

The latter is quite interesting as an everyday example. In differential power analysis, an attacker exploits the possibility that individual operations - such as cryptographic key management and generation - performed by a machine cause different amounts of power to be used. Careful observation can thus be made, and the precise internal state of the machine (and its secrets) reconstructed. Designers of bank chip cards and other sensitive cryptographic devices (like VPN login tokens) pay careful attention to this known side-channel and specifically design their hardware to resist differential power analysis attacks. Indeed, whole industries of consultants have been providing analytical services to companies making bank chip cards for many years.

Timing attacks are similar to differential power analysis in that they have traditionally been used to attack software and hardware cryptographic functions. But timing attacks exploit the ability to precisely observe time rather than power usage. In a computer system vulnerable to a timing side-channel attack, individual operations take a differing amount of time to perform depending upon the data that is being processed or manipulated, such as the generation of a cryptographic key. For example, in the AES encryption standard, one of the steps (MixColumns) involves a multiplication step. An implementation vulnerable to a timing side-channel might use multiply operations that take different amounts of time based upon the value being multiplied.

The operations under measurement in some timing side-channel attacks could be quite coarse, such as the time taken by a software library helper function that varies depending upon the data given to it as inputs, or an attacker might even be able to measure the time taken by individual microprocessor instructions that form the program under attack. In the latter case, they exploit implementations of simple operations that have been optimized to "shortcut" under some circumstances, such as when a microprocessor knows that a multiply operation uses a small value that needs fewer individual multiplication sub-stages. As a result of these well-known attacks, it is common for encryption libraries to be "hardened" such that they perform critical calculations in "constant time", independent of value. Further, some microprocessors provide constant time variants of their instructions that sacrifice optimizations for secure operation.

A growing body of academic research already exists into timing side-channel attacks, and it is common for communities of both academics and cryptographic library developers to collaborate on the development of software hardened against timing side-channel analysis. It is with this in mind that many open source (as well as proprietary) cryptographic software libraries worth their salt already consider timing side-channels during their design and development. Again, timing attacks are not a new concept and their exploitation will continue long after the industry collectively addresses any one mechanism through which timing can be observed.

With the context of timing side-channels set, let’s turn our attention to TLBleed. The exact details are yet to be disclosed publicly, and we will not do so here, but we will describe the attack to the extent already disclosed by the researchers.

Modern microprocessors support a level of abstraction known as "virtual" memory. In this model, applications (as well as the operating system kernel, such as Linux) each have their own view of system memory as if they owned all of it. The individual memory addresses, known as Virtual Addresses (VAs), used by the application appear to be linear and span an almost infinitely large range. Under the hood, the operating system works in tandem with the hardware to map these virtual addresses into Physical Addresses (PAs) that are contained within the (much smaller) physical RAM chips of the machine. This is done using a concept known as "paging" in which the operating system maintains "page tables" for each application (and the kernel) that translate virtual addresses into physical address. The name "page" comes from the fact that the individual translations happen at a granularity - typically 4KB - also known as the "page size".

The page size stems from the fact that the tables describing memory address translations themselves take up space. A compromise must be reached between a desire for maximum flexibility, coupled with maintaining manageability and limits upon the growth of page tables. As a result, not only is the concept of pages used to group memory ranges into manageable sized blocks, but the tables used to manage paging are multi-level, and based upon the virtual address. Ranges of virtual memory that are not in use (the vast majority of virtual memory for most applications) don’t really take up any overhead, while those that are in use by an application utilize sub-level page tables. The page tables are stored in physical memory and microprocessors access them directly, without using the virtual memory abstraction.

As a result of using page tables that are themselves stored in memory that is external to the microprocessor chip, looking up a translation from a virtual address to its underlying physical address can require many individual (slow, compared to the speed of the microprocessor) reads from external RAM. This is known as a "page table walk", and on commodity machines, this can be 5 or more individual memory reads for a single translation. Needless to say, it would be incredibly inefficient if the processor had to perform a hardware page table walk for every single virtual memory access. So, it doesn’t. Instead, modern processors implement a feature known as a TLB or "Translation Lookaside Buffer". TLBs cache translations from virtual to physical addresses for the currently running application. While they behave as caches in the sense of storing recently used translations for memory addresses, this has nothing to do with CPU data caches that store the actual content of those memory locations; those are entirely separate.

TLB implementation can be quite complex. It’s part of the "fast path" logic in modern CPUs and heavily optimized. Normally, entries are automatically allocated as the processor uses virtual memory, and evictions (replacement of entries) happens also automatically whenever the TLB is full and a new translation is needed. In that case, various hardware-level algorithms will seek to replace older translations with a newer one. If the older translation is needed subsequently, it will need to be recreated from a slow page table walk. TLBs are usually managed per virtual address space and traditionally were flushed (by the operating system) on process context switch from one application to another. Because this adds overhead, requiring that TLBs be repopulated by slow page table walks following each process switch, recent CPUs from most vendors implement address space IDs (known as PCIDs in x86) and the TLBs are actually tagged by active process address space. This aided greatly in reducing the impact of the Meltdown mitigations since the kernel now has a separate address space to manage.

Operating systems don’t have direct visibility into the TLBs. We can flush their content (under certain circumstances), and we can manage the page tables that are used to populate their entries, but we cannot directly read the TLBs or even know their structure. Indeed, most microprocessor vendors won’t publish the precise details of their TLB implementation (known as the "organization"), since this is deemed to be proprietary information. But it is possible to infer some level of detail about a TLB organization by monitoring memory access latencies and using other related techniques. In this way, it is possible to derive the replacement algorithms that are used to determine which entries of a TLB will be evicted, or how the hashing process within the hardware takes a virtual address and looks up the correct entry of the TLB for that translation.

This is what the Systems and Network Security Group at Vrije Universiteit Amsterdam (VUSec) have been able to achieve. By understanding the organization of the TLBs used in contemporary Intel microprocessors, the VUSec team isable to determine which entries will be used for given virtual address translations. Consequently, they are able to perform attacks that are similar to those used in cache side-channel analysis in which they create intentional conflicting TLB entries and monitor for evictions, to infer the TLB use by another application.

In the reported case of TLBleed, the team exploited the close coupling of sibling Hyper-Threads in Intel’s SMT implementation. SMT is the concept of duplicating minimal per-thread resources needed to maintain contextual separation and otherwise allowing multiple threads of execution to share the same underlying single processor core. Thus, two threads (or processes) share execution units (adders, multipliers, etc.), as well as other expensive (in terms of silicon real estate "area") per-core features such as TLBs and data caches. The two threads think they have separate cores to themselves, but instead "competitively" utilize the underlying resources. Since both threads are unlikely to need exactly the same resources at the same moment, careful hardware resource scheduling can improve utilization while not significantly harming performance seen by either thread. In some cases, the "win" from SMT can be as much as a 30% improvement in performance, in others it can actually reduce performance.

As with some other implementations, Intel’s SMT Hyper-Threads share "competitively" the TLBs within the underlying processor core. The activity of one thread will adversely affect the other in terms of TLB evictions. Thus, it is possible to carefully craft malicious code that can intentionally cause TLB conflict evictions that can be measured. As the peer thread re-allocates TLB entries for its own virtual memory translations, the malicious code is able to determine (through various means) what those entries are. Careful timing of those accesses can further be used to infer what data is being manipulated. The precise details of the vulnerability are, once again, to be disclosed at a later time. The question for now is, what can we do to mitigate against attacks?

As we said, timing side-channels are far from new, and well-written code that uses secrets (such as crypto libraries) should already be hardened against such attack. While we are bound to find some specific new examples of software weakness (and we will have to patch those when these are found), this is part of the usual process of finding and fixing bugs that happens every day, and which we take very seriously as part of our obligation to our customers and the broader community within which we collaborate.

Nonetheless, the potential threats described in TLBleed reminds us of the nature of tightly-coupled resource sharing that comes with SMT, and presents an opportunity to remind ourselves of its benefits and limitations, as well as good practices when it comes to deploying hardware that uses technologies such as these.

It is important to remember that SMT is intended to benefit two cooperating threads. That’s where the "T" in SMT comes from. In most SMT implementations, sibling threads share critical hardware resources such that if they aren’t actively cooperating they can significantly disrupt the throughput or performance of the other sibling thread. Microprocessor designers have invested time and effort in the underlying hardware thread schedulers that they use in an effort to reduce this hit, but it is important to remember that SMT was never intended as a means to get "cheap" extra cores. While Linux may represent threads like cores in "/proc/cpuinfo", the scheduler is (usually) smart enough to tell the difference between true cores and sibling threads.

Early kernels (many years ago) couldn’t tell the difference, which actually hurt performance as unrelated applications were scheduled on peer threads of the same core. Contemporary kernels will try to carefully balance where threads and processes run in order to avoid negative interactions, while gaining a modest benefit from the appearance of additional resources. Still, some customers, especially in High Performance Computing or Real Time (latency sensitive) environments already elect to disable SMT or to apply careful CPU "pinning" to avoid any potential negative performance interaction between threads. Thus, it is common industry practice today to tune systems with the benefits and known disadvantages of SMT in mind.

Extending this notion of system tuning to security is little different. Tightly shared resources (of any kind) will always present the possibility that side-channel attacks (whether cache or timing based) can be more easily exploited. Thus it is prudent to take steps to isolate application threads in a manner that reduces the chance of a malicious user being able to co-schedule their exploit on a sibling of a thread performing a critical security operation, such as the generation of cryptographic keys. Moreover, this makes good sense from a performance perspective as unrelated applications sharing the same core are more likely to disrupt the performance of one another even if they aren’t malicious in nature to begin with. Understanding what workloads run on a system, what processes require higher levels of security, and where you need to isolate users/systems/containers is critical to managing events like these.

As a result of these considerations, Red Hat recommends that all users consider SMT as part of their normal tuning process, both to help maximize performance, and to reduce the risks introduced through potential side-channel analysis in TLBleed and other similar attacks. SMT threads should never be seen as cheap extra cores, but instead as an intrinsic part of a single core, which should be provisioned and tuned as a unit. Indeed, this approach can (in many cases) actually increase overall system throughput and performance, while also following industry security best practices which today call for the prudent use of shared resources.