Spectre via AVX Clock Speed Throttle? By Alexander J. Yee

Note: This blog was supposed to go up in June. But rather than posting it outright, I sent it to Intel and they asked me to wait.

For the purpose of this, I'll call the attack described in this blog as, "AVX Clock Spectre".

Since then a lot of stuff happened. Most notably, NetSpectre, a closely related exploit, has been publicized. Given how similar NetSpectre is to AVX Clock Spectre, it became apparent that:

Given knowledge of either NetSpectre or AVX Clock Spectre, it would be trivial to derive the other. The success of NetSpectre being exploitable likely means that AVX Clock Spectre is also exploitable using the same methods and/or the method described here.

When I reached out to Intel again, they told me that the concepts in this blog were known both internally and to the public. Since there is no reason to further extend the information embargo, they have given me permission to publish this blog.

In any case, the original blog (as meant to be published in June) is here - in it's original form.

I haven't done an off-topic blog in a long time (last time was the Venus Transit in 2012). But it felt appropriate this time.

Back To:

Background

Disclaimer: I am not a security expert nor do I pretend to be. I'm just a hardware enthusiast who likes to play with SIMD. So if it sounds like I don't know what I'm talking about, now you know why.

If you're here, you probably already know all about the Spectre and Meltdown vulnerabilities. So I won't bore you reiterating all of that. So from this point on, I will assume that you already have a gist of how they work and are familiar with the vanilla array-out-of-bounds Spectre example.

Since Spectre/Meltdown opens up a completely new class of vulnerabilities, new ones have been popping up on a regular basis. In this blog, I'll discuss a potential new Spectre variant based on Intel's implementation of the Advanced Vector Extension (AVX) instructions.

I say "potential" because it is currently theoretical. The viability of this attack is contingent on one processor architectural detail for which I am not sure of. Furthermore, I have not tried to make a proof-of-concept of the attack. I'll leave that to the experts who actually know what they are doing.

The Exploit

High-Level Description:

AVX is a 256-bit SIMD instruction set extension for x86 processors. It was first introduced by Intel in 2011 with their Sandy Bridge processor line and is now widely supported by everyday hardware. AVX512 is a the latest variant that extends it to 512-bit vectors.

The purpose of AVX is to perform multiple operations in a single instruction. By the very nature of this, AVX instructions consume more power than regular instructions. In fact, they consume so much power that recent Intel processors will lower the clock speed to maintain stability and to avoid exceeding thermal limitations. It is this clock speed throttle that may be vulnerable to a Spectre side-channel attack.

Vanilla Spectre Pseudocode:

char buffer[4097]; // Flush "buffer" completely out of cache. if (false){ // Evaluates to false, but predicted true. int victim_data = *victim_addr; char junk = buffer[(victim_data & 1) * 4096]; } // Wait a few hundred cycles. // Time how long this takes: char read = buffer[0];

The original Spectre example used cache timings to expose information. Inside a mispredicted branch, you load the victim address (0 or 1) and use it to conditionally touch a cacheline. After recovering from the misprediction, you read the same cacheline manually. Then depending on how long the read takes, you can infer whether the value at the victim address is 0 or 1. This process can then be repeated to read arbitrary amounts of data.

The following caveats apply:

The attacker needs to train the branch predictor such that it reliably mispredicts the branch.

The attacker needs to reset the cache state between attacks.

AVX Spectre Pseudocode:

__m256d junk; if (false){ // Evaluates to false, but predicted true. int victim_data = *victim_addr; if (victim_data & 1){ junk = _mm256_mul_pd(junk, junk); // Run some heavy AVX instruction. } } // Wait 500 microseconds for a possible throttle to kick in. // Run a benchmark.

The AVX Spectre exploit follows the same general approach. Inside a mispredicted branch, you load the victim address. But instead of using it to conditionally touch a cacheline, you conditionally execute an AVX instruction. If an AVX instruction was executed, the processor will reduce its clock speed. By running a benchmark to determine whether a clock speed reduction has occurred, you can infer whether the value at the victim address is 0 or 1.

Why should this work? According to section 15.26 of Intel's architecture documentation:

A single AVX or AVX512 instruction of the right type is enough to trigger the clock speed throttle.

The clock speed throttle can be triggered by instructions that are run speculatively.

When the clock throttle kicks in, it takes 2 milliseconds of no-AVX instructions before the clock speed returns to the original speed. This is more than long enough to run a benchmark.

The Inner Branch:

If the description above sounds too simple, it is. There is one major barrier standing in the way: You need a branch in order to conditionally execute an AVX instruction. And this 2nd (inner) branch is also subject to branch prediction.

Why is a 2nd branch needed?

There are no AVX or AVX512 instructions that can conditionally affect the clock speed based on data input.

Indirect branches suffer the same problem with prediction.

Self-modifying code will flush the entire instruction pipeline and force you out of speculation from the 1st (outer) branch.

So why is the 2nd branch problematic?

Branch Prediction of Inner Branch Inner Branch Predicted Taken (no-AVX) Inner Branch Predicted Not-Taken (AVX) Victim Bit = 0 (Don't run AVX) Scenario A: CPU takes branch and skips over AVX instruction. This prediction is correct. No roll-back happens. No AVX instruction is ever executed. Clock Speed Throttle: No Scenario B: CPU skips branch and executes AVX instruction. This is a misprediction. But it doesn't matter what happens next since an AVX instruction has already been executed. An AVX instruction is executed speculatively. Clock Speed Throttle: Yes Victim Bit = 1 (Run AVX) Scenario C: CPU takes branch and skips over AVX instruction. This is a misprediction. But because this the 2nd branch, the outer branch misprediction resolves first. Thus the 2nd branch is never re-executed down the correct path. No AVX instruction is ever executed. Clock Speed Throttle: No Scenario D: CPU skips branch and executes AVX instruction. This prediction is correct. No roll-back happens. An AVX instruction is executed speculatively. Clock Speed Throttle: Yes

Regardless of what the victim bit is and how the inner branch is predicted, there is no way to determine whether the victim bit is a 0 or a 1. The only thing that affects the clock speed is what direction the inner branch is predicted to go. So even if you can manipulate the branch predictor to force the inner branch to go in either direction you want, it won't help you infer the value of the victim bit.

So what gives?

Slowing Down the Outer Branch:

This is where this entire article and the AVX Spectre exploit becomes iffy.

Looking at the table above, the weakest link is Scenario C. The assumption is that if the processor is multiple branches into speculation (all of which are mispredictions), are the earlier (outer) branches necessarily resolved before the later (inner) branches?

If the answer is yes, then the proposed exploit is not viable as described and thus this article is a complete waste of time.

If the answer is no, it may be possible to nudge Scenario C so that the inner branch resolves first - thus executing the AVX instruction.

As of this writing I do not know the answer to this question. But let's assume no, it is possible for a later branch to resolve before an earlier branch. How would we go about influencing this?

The somewhat obvious approach is to:

Slow down the outer branch by making it depend on a very long cache miss. Make the inner branch depend on something that is known immediately.

Since the inner branch is dependent on the victim bit, this means that the victim address needs to be brought into cache prior to the attack. The victim address is inaccessible, otherwise you wouldn't need Spectre to read it. But you can bring it into cache either by prefetching it, or reading it through a mispredicted branch.

With the outer branch taking hundreds of cycles to resolve, the inner branch has enough time to resolve, roll-back, and execute down the correct path. This executes the AVX instruction and causes the clock speed throttle. So the new table looks like this:

Branch Prediction of Inner Branch Inner Branch Predicted Taken (no-AVX) Inner Branch Predicted Not-Taken (AVX) Victim Bit = 0 (Don't run AVX) Scenario A: CPU takes branch and skips over AVX instruction. This prediction is correct. No roll-back happens. No AVX instruction is ever executed. Clock Speed Throttle: No Scenario B: CPU skips branch and executes AVX instruction. This is a misprediction. But it doesn't matter what happens next since an AVX instruction has already been executed. An AVX instruction is executed speculatively. Clock Speed Throttle: Yes Victim Bit = 1 (Run AVX) Scenario C: CPU takes branch and skips over AVX instruction. This is a misprediction. The inner branch resolves, rolls-back, and executes down the correct path which contains the AVX instruction. An AVX instruction is executed speculatively. Clock Speed Throttle: Yes Scenario D: CPU skips branch and executes AVX instruction. This prediction is correct. No roll-back happens. An AVX instruction is executed speculatively. Clock Speed Throttle: Yes

Finally! If we train the branch predictor so that it always predicts taken, we can now distinguish 0 vs. 1 for the victim bit.

Summary/Pseudocode of Attack:

bool condition = false; uint64_t threshold; // Some pre-calibrated number. uint64_t run_benchmark(){ uint64_t start = __rdtsc(); // Do some garbage work that takes about 1 - 2 milliseconds. return __rdtsc() - start; } bool read_bit(const bool* victim_addr){ // Step 1: Flush "condition" out of cache. _mm_clflush(&condition); // Step 2: Prefetch victim data. _mm_prefetch((char*)victim_addr, _MM_HINT_T0); // Step 3: Train branch predictor so that: // Outer branch will be predicted not taken. (enter if-statement) // Inner branch will be predicted taken. (skip if-statement) __m256d junk; // Step 4: Run exploit. if (condition){ // Outer Branch: Evaluates to false, but predicted true. bool victim_data = *victim_addr; if (victim_data){ // Inner Branch: Predicted false. junk = _mm256_mul_pd(junk, junk); // Run some heavy AVX instruction. } } // Step 5: Wait 500 microseconds for a possible throttle to kick in. // Step 6: Run a benchmark. uint64_t score = run_benchmark(); return score < threshold; }

Scope and Impact

The AVX Spectre, if exploitable, requires that the processor have a measurable change in performance following an AVX instruction. These processors include:

Intel Haswell (servers and some laptops)

Intel Broadwell (servers and some laptops)

Intel Skylake (some laptops?)

Intel Kaby Lake

Intel Coffee Lake

Intel Skylake X and Skylake Purley

Intel Cannonlake

This is basically the majority of Intel processors since 2014-ish. AMD processors aren't affected since they don't run AVX at full speed. So unlike Intel processors, AMD processors don't need to downclock to keep the thermals within limits.

On the software side, the scope of the AVX Spectre theoretically should be largely the same as the original Spectre as described in the paper. If the attacking code is native code, it will be able to access all memory in the same address space regardless of access restrictions. Likewise, it may be possible to escape browser sandboxing. Though finding vulnerable code in libraries and VMs may be more difficult due to need for an AVX instruction.

However, there is one notable difference that could make AVX Spectre more easily exploited.

The original Spectre exploit requires a high precision timer to measure cache misses that are on the order of nanoseconds. The resulting browser and sandbox mitigations were to reduce the precision of timers and disable any shared memory functionality which would be used to create a timer.

However, AVX Spectre relies on clock speed fluctuations that are on the order of milliseconds. Not only is this long enough to not need a high precision timer, but it is also long enough to survive a context switch - thus opening up the attack surface to other processes in the system. Likewise, the clock speed throttle can be observed by another thread if it is running on the same physical core in the case of Hyperthreading.

The longer timespans of AVX Spectre means that the bandwidth is much lower than the original Spectre. As it takes approximately 2.5 milliseconds for a full clock speed fluctuation, the exploit cannot be reliably used to read more than 400 bits/second per physical CPU core.

Mitigations

Software:

For high-level applications like web browsers and JIT compilers for managed languages, the most straight-forward solution is to not generate any AVX at all. The performance hit should be minimal since auto-vectorization is still pretty much a lost cause with current compiler technology. In other words, there isn't much to lose anyway by throwing away AVX for JIT'ed code. Furthermore, JITs don't "try as hard" since they are running in real time whereas offline compilation has unlimited time.

For low-level or native applications, things get more murky. The only two obvious mitigations are nuclear options:

Disable AVX completely at the operating system level (by tuning off XSAVE). Force the processor to run at a constant frequency.

The 1st option will basically kill off many HPC applications. Microsoft, please don't do this. You will break my heart.

The 2nd option means downclocking the processor for all workloads to the AVX or AVX512 speed. Unfortunately, this will cripple everything non-AVX.

Hardware:

This is easy. Do not let AVX and AVX512 instructions trigger the clock down until they are no longer speculative. A delay of a few hundred cycles to wait out any possible speculation is negligible compared to the millisecond intervals in which the clock downs operate in. Perhaps this can also be done with a microcode update. But only Intel can tell us that.

An alternate (and better approach) is to not initiate the clock down until "many" AVX instructions have been executed. Where "many" means, "large enough to exceed any possible speculation window". This is likely the better approach since it avoids clock downs due to one-off AVX instructions generated by an overeager compiler in otherwise non-vectorizable code.

Conclusion

Not much to say other than this could be yet another drop in the rainstorm of Spectre exploits. But I must reiterate that the AVX Spectre, as described in this article, is nothing more than a theoretical attack at this point. So it is possible that it may be neither viable nor exploitable.

The whole attack is contingent on the ability to reverse the order in which successive branch mispredictions are resolved. And while I consider myself to be knowledgeable in low-level architecture, it is not (yet) at this extreme level of detail. So I'm hoping someone else can take over and finish the attack with a proof-of-concept.

If the AVX Spectre is real and exploitable, it can probably be generalized to any instruction that causes a measurable performance difference.

Questions or Comments

I don't do comments on this site, so just hit me up on twitter. For any extended discussions, I can start a Gist on Github or something.

Otherwise, contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.