Over the past several years, we have watched the Capture The Flag circuit mature in both complexity and creativity. The best CTF’s strive to push the envelope, but the skill cap of the active community has risen with it. As the lines begin to blur between a game and actuality, CTF challenges are more frequently having an impact on the real-world [1, 2, 3].

Last month while playing in the qualification round of DEFCON CTF 2019, we identified asymmetric behavior within Intel’s Transactional Synchronization Extensions (TSX). In this blogpost we will discuss how we were able to leverage this subtle issue to perform a simple, yet novel, CPU-level attack.

Modeling an attack against the CPU to escape a TSX-based machine-code jail

The Challenge

DEFCON CTF is an annual computer security competition that attracts some of the best hackers from around the globe to solve a number of complex challenges in less than 48 hours.

This year, a certain ‘shellcoding’ challenge by Davide Balzarotti caught our attention. Through creative use of Intel’s TSX instructions, the challenge implements a machine-code ‘jail’ that it can use to safely execute user supplied shellcode. The goal is to escape this jail and launch a shell on the system.

Hotel-California, a challenge from DEFCON Quals 2019

The provided challenge binary is running on a remote server that is exposed to the internet. When one connects to the challenge service, it prints a welcome message and prompts the user to enter arbitrary shellcode:

Connecting to the challenge service

Prior to running the user supplied shellcode, the challenge first executes a series of instructions which have been carefully structured to jail the executing thread. Effectively, this stub enters a transactional execution context (the ‘jail’) before subsequently throwing away the ‘keys’ required to escape it.

In the code block below, we have provided the jailiing stub that gets prepended to our shellcode on a RWX page of memory. This snippet may be useful to revisit as you make your way through the post.

loc_F70 : lea rdi , loc_F70 sub rdi , 0x14 mov eax , [rdi] mov [rdi], eax ; eax = key_X, ebx = key_Y xor rax , rax ; destroy key_X xor rcx , rcx xor rdx , rdx xor rsi , rsi xacquire lock xor [rdi], ebx ; enter transactional state, key_Z = key_X ^ key_Y xtest jnz short loc_F95 retn loc_F95 : xor rbp , rbp xor rsp , rsp xor rdi , rdi xor rbx , rbx ; destroy key_Y user_shellcode : ... ; begin executing our user-supplied shellcode

Before discussing this challenge in greater detail, it is necessary to have some understanding of the transactional execution context. Since transactional memory systems are a pretty esoteric topic, we will do our best to provide a high level overview of these concepts as they pertain to this challenge.

Transactional Memory

Transactional memory is a concurrent programming construct that allows a group of read or write memory operations to execute atomically. Intel describes their TSX implementation simply as ‘speculative locking’. Naïvely, these function a bit like high-performance CPU-level locks that ensure safe access to shared memory.

TSX added a number of new instructions to the x86 instruction set. For the purpose of this post, we will only be focusing on TSX’s Hardware Lock Elision (HLE) instruction prefixes XACQUIRE and XRELEASE :

xacquire lock mov [rax], 1 ; write a value to acquire the lock ... xrelease lock mov [rax], 0 ; release the lock by restoring its original value

The XACQUIRE instruction prefix is used to designate the start of a ‘transactional region’ (or critical section). When a thread encounters this prefix, it will take a snapshot of the processor state and begin speculatively executing. This is what we have been calling a ‘transactional execution context’ or ‘transactional state’.

While executing transactionally, the CPU will record all memory operations such as the following:

... mov rbx , [ rax + 8 ] mov rcx , [ rax + 32 ] add [ rax + 32 ] , 1 ...

These memory operations are logged by two privately held structures known as the ‘read-set’ and ‘write-set’ in the L1 cache. These sets act a bit like a ‘veil’ over the underlying process memory.

For example, when an instruction attempts to read from memory in a transactional execution context, the CPU will first attempt to fetch the desired memory from the write-set. The write-set is where the CPU stores its speculative memory modifications which are not yet committed to the backing process memory.

A pseudo-visualization of the write-set intercepting some transactional memory reads

To exit the critical section and commit the transactional state to a normal execution context, the XRELEASE prefix must be used on an instruction that will restore the lock to its original value. At this stage, the completed transaction will atomically commit the read/write operations of the speculative execution.

Transactional Aborts

While executing in a transactional context, the thread has a strict set of regulations that it must abide by. There is little guarantee that a transactional region will actually execute to completion. Failing to abide by these recommendations will cause the transaction to abort.

Specifically, Intel warns of the following:

The protected critical sections should not conflict at cache-line granularity.

The protected critical section should be short enough to not be affected by an interrupt or context switch.

The protected critical section should not perform a system call.

The protected critical section should not touch more data than fits in L1 cache.

There may be a nesting limit for speculation.

Additionally, there are a number of other caveats and limitations one must watch out for:

Asynchronous events (NMI, SMI, IPI, …) may cause the transaction to abort.

Synchronous events (signals) may cause the transaction to abort.

Invalid memory accesses will cause the transaction to abort.

Special instructions (CPUID, x87, MMX, …) will cause the transaction to abort.

Attempting to release the lock with the incorrect value will abort.

Attempting to release the lock with a different address will abort.

Writing to the lock without an XRELEASE will cause an abort.

will cause an abort. Implementation-specific conditions may cause an abort.

Background system activity may cause an abort.

…

If you are trying to do anything but execute the most basic of instructions, the transaction will probably abort.

When a transaction aborts, the CPU will revert to the checkpoint taken prior to executing the XACQUIRE instruction. This restores all registers, and discards all speculative writes that occured in the transactional state.

Transactional Jailing

With a basic understanding of transactional memory concepts, we can now turn our attention back to the CTF challenge. Specifically, we will use this section to discuss how this challenge is able to create a machine-code jail using the transactional instructions.

Prior to entering the transactional state, the challenge securely generates two random 32 bit values. For the remainder of this post, we will refer to these magic values as key_X and key_Y :

Randomly generating the 32bit keys from /dev/urandom

While describing how the XACQUIRE and XRELEASE instruction prefixes worked, we stated that they only work with instructions that write to the lock. Writing to the lock is what normally transitions the thread in and out of the transactional execution context.

Releasing the lock requires the transactional thread to write back the original lock value using an instruction prefixed with XRELEASE . This is because HLE treats the lock’s value as a ‘label’ or ‘handle’. But what happens if you don’t know what the original lock value was? Well, this is excactly the scenario this challenge creates:

... mov [rdi], eax ; eax = key_X, ebx = key_Y xor rax , rax ; destroy key_X xor rcx , rcx xor rdx , rdx xor rsi , rsi xacquire lock xor [rdi], ebx ; enter transactional state, key_Z = key_X ^ key_Y ...

To enter the transactional state, the challenge XOR’s the original lock value key_X with key_Y . This drops the thread into the transactional execution context, with key_Z (the result of the XOR) as the ‘acquired’ lock value.

The challenge is then careful to ensure all traces of the original key_X and key_Y values are cleared from both memory and registers. Only then is the challenge willing to execute our shellcode in the super-fragile transactional execution context:

... xor rbp , rbp xor rsp , rsp xor rdi , rdi xor rbx , rbx ; destroy key_Y user_shellcode : ... ; begin executing our user-supplied shellcode

Without key_X or key_Y , there is no way we can ‘decrypt’ key_Z to recover the original lock value. If we don’t have the original lock value, we cannot release the lock. If we can’t release the lock, there is no way our shellcode will be able to execute syscalls, or launch a shell.

The CPU Bug (?)

After spending some time experimenting with code execution in the transactional CPU state, we were confident that there was absolutely no way the challenge author expected us to trick the CPU into letting us make syscalls. Our focus shifted towards gracefully releasing the lock to escape the transaction.

Since this challenge happened to place the lock on the same RWX page as our shellcode, we hypothesized that an unadulterated copy of key_X might still exist in the instruction cache of the CPU. Additionally, we speculated that the CPU execution pipeline was prone to fetch from the instruction cache differently than it would the data cache.

This is because modern CPU architectures often maintain the instruction cache independent of the typical data cache.

CPU Core Pipeline Functionality of the Skylake Microarchitecture (from Intel)

Through black-box CPU testing, we were able to both test & confirm this hypothesis during the CTF.

From what we could discern, Intel’s TSX implementation does not enlighten the instruction cache to the transactional read-sets, write-sets, or elided lock values. When the instruction decoding pipeline fetches from the instruction cache, it does not trap to the active transactional memory sets.

Leap of Faith

So how does one ‘leak’ the 32bit key_X out of the instruction cache? You try to execute it.

All things considered, the basis of our attack is relatively simple. Upon entering the transactional state, we will use our arbitrary shellcode execution to jump into the elided lock value which we have been calling key_Z .

The leap of faith is used to jump directly into an elided lock

If we use our transactionally jailed shellcode to try to read the 32bit value at this location we will get the elided lock value of key_Z . But by attempting to execute the elided lock, the original underlying value of key_X is fetched out of the instruction cache, decoded, and then executed.

Because we are planning on jumping into a both random and unobservable value, we call this the ‘leap of faith’.

Solving The Challenge

Solving this challenge will require us to use the ‘leap of faith’ to infer the true value of key_X . But inferring the contents of unobservable memory using only execution is probably near impossible when working with purely static code.

Thankfully, the unobservable 32bit value of key_X is random for each attempt. This means that we can actually use bruteforce to exert some control over what the unobservable memory/code will be.

Taking a stab at interesting 'shellcode' sequences that can be bruteforced

With only four bytes of execution, the number of instructions that are useful to us here is extremely limited. To make matters worse, we can only feasibly control two of the four bytes. Our best solution script hinged on the RET imm16 instruction. Blindly bruteforcing a key_X value of C2 XX XX 90 will take approximately 64k attempts (16bits).

When we jump onto this bruteforced value, it will immediately pop a ‘return’ address off the stack (which we craft to point back into our shellcode) and jump to it. RET imm16 completes its execution by adding the unknown 16bit immediate XX XX to rsp .

A high level overview of how our TSX based Hotel-California exploit will work

Having arrived back in our shellcode, we can now diff the modified rsp from its original value. This allows us to extract the unknown imm16 from the RET imm16 instruction that we just executed. We have now recovered key_X .

; diff RSP to compute C2 XX XX into rax sub rsp , r10 mov rax , rsp ; combine all the components of key_X into ecx mov cl , 0x90 ; ecx = 0x90 shl ecx , 16 or ecx , eax ; ecx = 0x90XXXX shl ecx , 8 mov cl , 0xC2 ; ecx = 0x90XXXXC2

With key_X we will be able to de-xor a copy of the elided lock value key_Z . This will give us key_Y . Having recovered key_X and key_Y , it is now possible to release the lock and escape the transactional jail.

; decrypt key_Z & unlock mov ebx , [r8] xor ebx , ecx ; ebx = key_Z ^ key_X xrelease lock xor [r8], ebx ; lock = key_Z ^ key_Y

At this point, we have escaped the transactional jail and are free to make syscalls. Adding some standard execve() shellcode to our exploit, the only thing left to do is let the script run in parallel to capture the flag:

Using the final exploit script to capture the flag.

Our final exploit script required a 16bit bruteforce. This is not ideal, but making ~65k connections to a CTF challenge service is also not unreasonable. At ~20 attempts per second, the solution script is able to pop a shell in an hour or two.

The flag: OOO{We haven't had a proper TSX implementation here since nineteen sixty-nine}

As much as our solution might appear to align with the spirit of a shellcoding challenge, it turns out that it was not the intended exploit [1, 2, 3]. Reaching out to Davide, the author of the challenge, he was not aware of any alternative solutions to the challenge.

Thanks to an unrelenting case of tunnel vision, we missed a bug in the challenge itself and broke the CPU instead. Driven by the illusion of a solution, nothing is sacred.

Conclusion

In this blogpost, we demonstrated how asymmetric behavior in Intel’s TSX implementation could be exploited to leak memory that would otherwise be masked by transactional memory logic. We used this generic CPU-level attack as a novel solution to a challenge from the DEFCON 2019 Qualifier CTF.

Due to the unique scenario modeled by this CTF challenge, it is unlikely that this issue poses a risk to real-world applications. Rather, we hope that this post may inspire new and innovative ideas for those who wish to further research cache inconsistencies in modern CPU’s.

Special thanks to sliden & adc for their contributions in discovering this solution.