Note: For the really short version, you can skip to the summary, but then what will do you for the rest of the day?

Introduction

This is a post about AVX and AVX-512 related frequency scaling.

Now, something more than nothing has been written about this already, including cautionary tales of performance loss and some broad guidelines, so do we really need to add to the pile?

Perhaps not, but I’m doing it anyway. My angle is a lower level look, almost microscopic really, at the specific transition behaviors. One would hope that this will lead to specific, quantitative advice about exactly when various instruction types are likely to pay off, but (spoiler) I didn’t make it there in this post.

Now I wasn’t really planning on writing about this just now, but I got off on a (nested) tangent, so let’s examine the AVX-512 downclocking behavior using target tests. At a minimum, this is necessary background for the next post, but I hope that it is also standalone interesting.

Note: If you are here because of your footnote fetish, skip straight to the good 🦶 stuff.

Table of Contents

You could perhaps trying skipping ahead to a section that interests you using this obligatory table of contents, but sections are not self contained, so you’ll be better off reading the whole thing linearly.

The Source

All of the code underlying this post is available in the post1 branch of freq-bench, so you can follow along at home, check my work, and check out the behavior on your own hardware. It requires Linux and the README gives basic clues on getting started.

The source includes the data generation scripts as well as those to generate the plots. Neither shell scripting nor Python are my forte, so be gentle.

Test Structure

We want to investigate what happens when instruction stream related performance transitions occur. The most famous example is what happens when you execute an AVX-512 instruction for the first time in a while, but as we will see there are other cases.

The basic idea is that the test has a duty period and every time this period elapses, we run a test-specific payload for the duration of the payload period which consists of one or more “interesting” instructions (which depend on the test). During the entire test we sample various metrics at a best-effort fixed frequency. This repeats for the entire test period. The sample period will generally be much smaller than the duty period: in our tests we use a 5,000 μs duty period and a sample period of 1 μs, mostly.

Visually, it is something like this (showing a single duty period: one benchmark is composed of multiple duty cycles back to back):

This diagram shows the payload period as occupying a non-negligible amount of time. However, in the first few first tests, the payload period is essentially zero: we run the payload function (which consists of only a couple instructions) only once, so it is really a payload moment rather than period.

Hardware

We are running these tests on a SKX architecture W-series CPU: a W-2104 with the following license-based frequencies:

Name License Frequency Non-AVX Turbo L0 3.2 GHz AVX Turbo L1 2.8 GHz AVX-512 Turbo L2 2.4 GHz

For one (voltage) test I also use my Skylake (mobile) i7-6700HQ, running at either it’s nominal frequency of 2.6 GHz, or the turbo frequency of 3.5 GHz.

Tests

The basic approach this post will take is examining the CPU behavior using the test framework above, primarily varying what the payload is, and what metrics we look at. Let’s get the ball rolling with 256-bit instructions.

256-bit Integer SIMD (AVX)

For the first test will use as payload the vporymm_vz function, which is just a single 256-bit vpor instruction, followed by a vzeroupper :

vporymm_vz: vpor ymm0 , ymm0 , ymm0 vzeroupper ret

We call the payload function only once at the start of each duty period. The duty period is set to 5000 μs and the sample period to 1 μs, and the total test time is set to 31,000 μs (so the payload will execute 7 times).

Here’s the result (plot notes), with time along the x axis, showing the measured frequency at each sample (there are three separate test runs shown):

Well that’s really boring. The entire test runs consistently at 3.2 GHz, the nominal (L0 license) frequency, if we ignore the a few uninteresting outliers.

512-bit Integer SIMD (AVX-512)

Before the crowd gets too rowdy, let’s quickly move on to the next test, which is identical except that it uses 512-bit zmm registers:

vporzmm_vz: vpor zmm0 , zmm0 , zmm0 vzeroupper ret

Here is the result:

We’ve got something to sink our teeth into!

Remember that the duty cycle is 5000 μs, so at each x-axis tick we execute the payload. Now the behavior is clear: every time the payload instruction executes (at multiples of 5000 μs), the frequency drops from the 3.2 GHz L0 license down to 2.8 GHz L1 license frequency. So far this is all pretty much as expected.

Let’s zoom in on one of the transition points at 15,000 μs:

We can make the following observations:

There is a transition period (the rightmost of the two shaded regions, in orange) of ~11 μs where the CPU is halted: no samples occur during this period. For fun, I’ll call this a frequency transition. The leftmost shaded region, shown in purple, immediately following the payload execution at 15,000 μs and prior to the halted region, is ~9 μs long and the frequency remains unchanged. This is not just a test issue or measurement error: this period occurs after the payload and is consistently reproducible. Although it looks like nothing interesting is going on in this region, we’ll soon see it is indeed special and will call this region a voltage-only transition. Although not fully shown in the zoomed plot, the lower 2.8 GHz frequency period lasts for ~650 μs. Not shown in the zoomed plot (but seen as a second downwards spike on the full plot, after the ~650 μs period of low frequency), there is another fully halted period of ~11 us, after which the CPU returns to it’s maximum speed of 3.2 GHz (L0 license). These attributes are mostly consistent across the three runs (so much that the series, in green, mostly overlaps and obscures the others) – but there are a few outliers in where the return to 3.2 GHz takes somewhat longer. This is consistent across runs: recovery is never faster than ~650 μs, but sometimes longer. I believe it occurs when an interrupt during the L1 region “resets the timer”.

Enter IPC

Although it is not visible in this plot, there is something special about the behavior of 512-bit instructions in the first shaded (purple) region – that is, in the 9 microseconds between the execution of the payload instruction and before the subsequent halted period: they execute much slower than usual.

This is easiest to see if we extend the payload period: instead executing the payload function once every 5000 μs, then looping on rdtsc , waiting for the next sample, we will continue to execute the payload function for 100 μs after a new duty period starts (that is, the payload period is set to 100 μs). During this time we still take samples as usual, every 1 μs – but in between samples we are executing the payload instruction(s). So one duty period now looks like 100 μs of payload followed by 4850 μs of normal payload-free hot spinning.

We lengthen the payload period in order to examine the performance of the payload instructions. There are several metrics we could look at, but a simple one is to look at instructions per second. As long as we make sure the large majority of the executed instructions are payload instructions, the IPC will largely reflect the execution of the payload.

As payload, we will use a function composed simply of 1,000 dependent 512-bit vpord instructions:

vpord zmm0 , zmm0 , zmm0 vpord zmm0 , zmm0 , zmm0 ; ... 997 more like this vpord zmm0 , zmm0 , zmm0 vzeroupper ret

We know these vpord instructions have a latency of 1 cycle and here they are serially dependent so we expect this function to take 1,000 cycles, give or take, for an IPC of 1.0.

Here’s what a the same zoomed transition point for this looks like, with IPC plotted on the secondary axis:

First, note that in the unshaded regions on the left (before 15,000 μs) and right (after 15,100 μs), the IPC is basically irrelevant: no payload instructions are being executed, so the IPC there is just the whatever the measurement code happens to net out to. We only care about the IPC in the shaded regions, where the payload is executing.

Let’s tackle the regions from right to left, which happens to correspond to obvious to less obvious.

We have the blue region, running from ~15020 μs to 15100 μs (where the extra payload period ends). Here the IPC is right at 1 instruction per cycle. So the payload is executing right at the expected rate, i.e., full speed. Keeners may point out that the very beginning of the blue period, the IPC (and the measured frequency) is a bit noisier and slightly above 1. This is not a CPU effect, but rather a measurement one: during this phase the benchmark is catching up on samples missed during the previous halted period, which changes the payload to overhead ratio and bumps up the IPC (details).

The middle, orange, region shows us what we’ve already seen: the CPU is halted, so no samples occur. IPC doesn’t tell us much here.

Voltage Only Transitions

The most interesting part is the first shaded region (purple): after the payload starts running but before the halt which I call a voltage only transition for reasons that will soon become clear.

Here, we see that the payload executes much more slowly, with an IPC of ~0.25. So in this region, the vpord instructions are apparently executing at four times their normal latency. I also observe an identical 4x slowdown for vpord throughput, using an identical test except with independent vpord instructions.

Perhaps surprisingly, this same slowdown occurs for 256-bit ymm instructions as well. This contradicts the conventional wisdom that on AVX-512 chips there is no penalty to using light 256-bit instructions:

The results shown above are for a test identical to the 512-version except that it uses 256-bit vpor ymm0, ymm0, ymm0 as the payload. It shows the same slowdown for ~9 μs after the payload starts executing, but no subsequent halt and no frequency transition. That is, it shows a voltage-only transition (lack of frequency transition is expected because we don’t expect a turbo license change for light 256-bit instructions).

By now, you are probably wondering about 128-bit xmm registers. The good news is that these show no effect at all:

Here, the IPC jumps immediately to the expected value. So it appears that the CPU runs in a state where the 128-bit lanes are ready to go at all times.

The conventional wisdom regarding this “warmup” period is that the upper part of the vector units is shut down when not in use, and takes time to power up. The story goes that during this power-up period the CPU does not need to halt but it runs SIMD instructions at a reduced throughput by splitting up the input into 128-bit chunks and passing the data two or more times through the powered-on 128-bit lanes.

However, there are some observations that seem to contradict this hypothesis (in rough order from least to most convincing):

The observed impact to latency and throughput is ~4x, whereas I would expect 2x for simple instructions such as vpor . The timing is the same for 256-bit and 512-bit instructions: despite that 512-bit instructions take at least 2x the work, i.e., need to be passed through the 128-bit unit at least 4 times. Some instructions are more difficult to implement using this type of splitting, e.g., instructions where both high and low output lanes depend on all of the input lanes (see how slow they are on Zen). I expected that maybe these instructions would be slower when running in split mode, but I tested vpermd and found that it runs at 4L4T, compared to 3L1T normally. So vpermd (including the 512-bit version) didn’t slow more than vpor , and in fact in a relative sense it slowed down less (e.g., the latency only changed from 3 to 4). The fact that the latency and throughput reacted differently for this instruction seems odd, and that it has now the exact same 4L4T timing as vpor seems like a strange coincidence. Oddly, when I tried to time the slowdown more precisely, I kept coming with fractional value around 4.2x, not 4.0x, kind of contradicting the idea that the instruction is simply operating in a different mode, which should still have an integral latency. As it turns out, all ALU instructions are slower in this mode, not just wide SIMD ones.

It was 5 that sealed the deal on this not being a slowdown related to split execution. I believe what is actually happening is the CPU is doing very fine-grained throttling when wider instructions are executing in the core. That is, the upper lanes are being used in this mode (they are either not gated at at all, or are gated but enabling them is very quick, less than 1 μs) but execution frequency is reduced by 4x because CPU power delivery is not a state that can handle full-speed execution of these wider instructions, yet. While the CPU waits (e.g., for voltage to rise, fattening the guardband) for higher power execution to be allowed, this fine-grained throttling occurs.

This throttling affects non- SIMD instructions too, causing them to execute at 4x their normal latency and inverse throughput. We can show with the following test, which combines a single vpor ymm0, ymm0, ymm0 with N chained add eax, 0 instructions, shown here for N = 3:

vpor ymm0 , ymm0 , ymm0 add eax , 0x0 add eax , 0x0 add eax , 0x0 ; repeated 9 more times

If only vpor is slowed down, each block of 4 instructions will take 4 cycles, limited by the vpor chain (the add chain is 3 cycles long). However, I actually measure ~12 cycles, indicating that we are instead limited by the add chain, each of which takes 4 cycles for a total of 12.

We can vary the number of add instructions (N) to see how long this effect persists. This table is the result:

ADD instructions (N) Cycles/ADD Delta Cycles (slow) Delta Cycles (fast) 2 4.1 2.3 -0.2 3 4.1 4.0 0.8 4 4.1 4.1 1.1 5 4.0 3.9 1.1 6 4.1 4.3 0.7 7 4.0 3.4 1.1 8 4.0 4.2 0.9 9 4.1 3.9 0.8 10 4.1 4.3 1.1 20 4.0 4.0 1.0 30 4.0 4.0 1.0 40 4.1 4.3 1.0 50 4.1 3.9 1.0 60 4.1 4.4 1.0 70 4.2 4.5 1.0 80 4.1 3.4 1.0 90 3.6 -0.2 1.0 100 3.3 1.1 1.0 120 2.9 0.9 1.0 140 2.7 1.1 1.0 160 2.5 1.2 1.0 180 2.3 0.7 0.9 200 2.2 0.8 1.0

The Cycles/ADD column shows the number of cycles taken per add instruction over the entire slow region (roughly the first 8-10 μs after the payload starts executing). The Delta Cycles (slow) shows how many cycles each additional add instruction took compared to the previous row: i.e., for row N = 30, it determines how much longer the 10 additional add instructions took compared to the row N = 20. The Delta Cycles (fast) column is the same thing, but applies to the samples after ~10 μs when the CPU is back up to full speed (that column shows the expected 1.0 cycles per additional add).

Here we clearly see that up to roughly 70 add instructions, interleaved with a single vpor , all the add instructions are taking 4 cycles, i.e., the CPU is throttled. Somewhere between 80 and 90 a transition happens: additional add instructions now take 1 cycle, but the overall time per add is (initially) close to 4. This shows that when add (and presumably any non-wide instruction) is far enough away from the closest wide SIMD instruction, they start executing at full speed. So the timings for the larger N values can be understood as a blend of a slow section of ~70-80 add instructions near the vpor which run at 1 per 4 cycles, and the remaining section where they run at full speed: 1 per cycle.

We can probably conclude the CPU is not just throttling frequency or “duty cycling”: in that case every instruction would be slowed down by the same factor, but instead the rule is more like “latency extended to the next multiple of 4 cycles”, e.g., a latency 3 instruction like imul eax, eax, 0 ends up taking 4 cycles when the CPU is throttling. It is likely that the throttling happens at some part of the pipeline before execution, e.g., at issue or dispatch.

The transition to fast mode when the vpor instructions are spread sufficiently apart probably reflects the size of some structure such as the IDQ (64 entries in Skylake) or scheduler (97 entries claimed). The core could track whether any wide instruction currently in that structure, and enforce the slow mode if so. The vpor instructions are close enough together, there is always at least one present, but once they are spaced out enough, you get periods of fast mode.

Voltage Effects

We can actually test the theory that this transition is associated with waiting for a change in power delivery configuration. Specifically, we can observe the CPU core voltage, using bits 47:32 of the MSR_PERF_STATUS MSR. Volume 4 of the Intel Software Development Manual let’s us on a secret: these bits expose the core voltage:

Let’s zoom as usual on a transition point, in this case using a 256-bit (ymm) payload of 1000 dependent vpor instructions. This 256-bit payload means no frequency transition, only a dispatch throttling period associated with running 256-bit instructions for the first time in a while. We plot the time it takes to run an iteration of the payload, along with the measured voltage:

The length of the throttling period is around 10 μs as usual, as shown by the period where the payload takes ~4,000 cycles (the usual 4x throttling), and the voltage is unchanged from the pre-transition period (at about 0.951 V) during the throttling period. At the moment the throttling stops, the voltage jumps to about 0.957, a change of about 6 mV. This happens at 2.6 GHz, the nominal non-turbo speed of my i7-6700HQ. At 3.5 GHz, the transition is from 1.167 to 1.182, so both the absolute voltages and the difference (about 15 mV) are larger, consistent the basic idea that higher frequencies need higher voltages.

So one theory is that this type of transition represents the period between when the CPU has requested a higher voltage (because wider 256-bit instructions imply a larger worst-case current delta event, hence a worst-case voltage drop) and when the higher voltage is delivered. While the core waits for the change to take effect, throttling is in effect in order to reduce the worst-case drop: without throttling there is no guarantee that a burst of wide SIMD instructions won’t drop the voltage below the minimum voltage required for safe operation at this frequency.

Attenuation

We might check if there is any attenuation of either type of transition. By attenuation I mean that if a core is transitioning between frequencies too frequently, the power management algorithm may decide to simply keep the core at the lower frequency, which can provide more overall performance when considering the halted periods needed in each transition. This is exactly what happens for active core count transitions: too many transitions in a short period of time and the CPU will just decide to run at the lower frequency rather than incurring the halts need to transition between e.g. the 1-core and 2-core turbos.

We check this by setting a duty period which is just above the observed recovery time from 2.8 to 3.2 GHz, to see if we still see transitions. Here’s a duty cycle of 760 μs, about 10 μs more than the observed recovery period for this test:

I’m not going to color the regions here, as by now I think you are probably (over?) familiar with them. The key points are:

The payload starts executing at 7600 μs, which is before the upwards frequency transition, we are still executing at 2.8 GHz - so initially the IPC is high, 1 per cycle.

is high, 1 per cycle. Despite the fact that we are already executing again 512-bit instructions, the frequency adjusts upwards a few μs later. Most likely what happened is that the power logic already evaluated earlier (say at ~7558 μs, just before the payload started) that an upwards transition should occur, but as we’ve seen the response is generally delayed by 8 to 10 μs so it occurs after the payload has already started executing.

Of course, as soon as the transition occurs, the core is no longer in a suitable state for full-speed wide SIMD execution, so IPC drops to ~0.25.

execution, so drops to ~0.25. Another transition back to low frequency occurs ~10 μs later and then full speed execution can resume.

So there is no attenuation, but attenuation isn’t really needed: the long (~650 μs) cooldown period between the last wide instruction and subsequent frequency boost means that the damage from halt periods are fairly limited: this is unlike the active core count scenario where the CPU has no control over the transition frequency (rather it is driven by interrupts and changes in runnable processes and threads). Here, we have the worst case scenario of transitions packed as closely as possible, but we lose only ~20 μs (for 2 transitions) out of 760 μs, less than a 3% impact. The impact of running at the lower frequency is much higher: 2.8 vs 3.2 GHz: a 12.5% impact in the case that the lowered frequency was not useful (i.e., because the wide SIMD payload represents a vanishingly small part of the total work).

What Was Left Out

There’s lots we’ve left out. We haven’t even touched:

Checking whether xmm registers also cause a voltage-only transition, if they haven’t been used for a while. We didn’t find any effect, but it also certain that some 128-bit instructions appear in the measurement loop which would hide the effect.

Checking whether the voltage-only transition implied by 256-bit instructions are disjoint from those for 512-bit. That is, if you execute a 256-bit instruction after a while without any, you get a voltage-only transition (confirmed above). If you then execute a 512-bit instruction, before the relaxation period expires, do you get a second throttling period prior to the frequency transition? I believe so but I haven’t checked it.

Any type of investigation of “heavy” 256-bit or 512-bit instructions. These require a license one level (numerically) higher than light instructions, and knowing if any of the key timings change would be interesting .

. Almost no investigation was made how any of these timings (and the magnitude of voltage changes) vary with frequency. For example, if we are already running at a lower frequency, frequency transitions are presumably not needed, and voltage-only transitions may be shorter.

Summary

For the benefit of anyone who just skipped to the bottom, or whose eyes glazed over at some point, here’s a summary of the key findings:

After a period of about 680 μs not using the AVX upper bits (255:128) or AVX-512 upper bits (511:256) the processor enters a mode where using those bits again requires at least a voltage transition, and sometimes a frequency transition.

The processor continues executing instructions during a voltage transition, but at a greatly reduced speed: 1/4th the usual instruction dispatch rate. However, this throttling is fine-grained: it only applies when wide instructions are in flight (details).

Voltage transitions end when the voltage reaches the desired level, this depends on the magnitude of the transition but 8 to 20 μs is common on the hardware I tested.

In some cases a frequency transitions is also required, e.g., because the involved instruction requires a higher power license. These transitions seem to first incur a throttling period similar to a voltage-only transition, and then a halted period of 8 to 10 μs while the frequency changes.

A key motivator for this post was to give concrete, qualitative guidance on how to write code that is as fast as possible given this behavior. It got bumped to part 2.

We also summarize the key timings in this beautifully rendered table:

What Time Description Details Voltage Transition ~8 to 20 μs Time required for a voltage transition, depends on frequency Frequency Transition ~10 μs Time required for the halted part of a frequency transition Relaxation Period ~680 μs Time required to go back to a lower power license, measured from the last instruction requiring the higher license

Thanks

Daniel Lemire who provided access to the AVX-512 system I used for testing.

David Kanter of RWT for a fruitful discussion on power and voltage management in modern chips.

RWT forum members anon³, Ray, Etienne, Ricardo B, Foyle and Tim McCaffrey who provided feedback on this post and helped me understand the VR landscape for recent Intel chips.

Alexander Monakov, Kharzette and Justin Lebar for finding typos.

Jeff Smith for teaching me about spread spectrum clocking.

Discuss

Discussion on Hacker News, Twitter and lobste.rs.

Direct feedback also welcomed by email or as a GitHub issue.

If you liked this post, check out the homepage for others you might enjoy.



