On Twitter, Paul Khuong asks: Has anyone done a study on the distribution of interrupt points in OOO processors?

Personally, I’m not aware of any such study for modern x86, and I have also wondered the same thing. In particular, when a CPU receives an externally triggered interrupt, at what point in the instruction stream is the CPU interrupted?

For a simple 1-wide in-order, non-pipelined CPU the answer might be as simple as: the CPU is interrupted either before or after instruction that is currently running. For anything more complicated it’s not going to be easy. On a modern out-of-order processor there may be hundreds of instructions in-flight at any time, some waiting to execute, a dozen or more currently executing, and others waiting to retire. From all these choices, which instruction will be chosen as the victim?

Among other reasons, the answer is interesting because it helps us understand how useful the exact interrupt position is when profiling via interrupt: can we extract useful information from the instruction position, or should we only trust it at a higher level (e.g., over regions of say 100s of instructions).

So let’s go figure out how interruption works, at least on my Skylake i7-6700HQ, by compiling a bunch of small pure-asm programs and running them. The source for all the tests is available in the associated git repo so you can follow along or write your own tests. All the tests are written in assembly because we want full control over the instructions and because they are all short and simple. In any case, we can’t avoid assembly-level analysis when talking about what instructions get interrupted.

First, let’s take a look at some asm that doesn’t have any instruction that sticks out in any way at all, just a bunch of mov instructions. The key part of the source looks like this:

.loop: %rep 10 mov eax , 1 mov ebx , 2 mov edi , 3 mov edx , 4 mov r8d , 5 mov r9d , 6 mov r10d , 7 mov r11d , 8 %endrep dec rcx jne .loop

Just constant moves into registers, 8 of them repeated 10 times. This code executes with an expected and measured IPC of 4.

Next, we get to the meat of the investigation. We run the binary using perf record -e task-clock ./indep-mov , which will periodically interrupt the process and record the IP . Next, we examine the interrupted locations with perf report . Here’s the output (hereafter, I’m going to cut out the header and just show the samples):

Samples | Source code & Disassembly of indep-mov for task-clock (1769 samples, percent: local period) ----------------------------------------------------------------------------------------------------------- : : Disassembly of section .text: : : 00000000004000ae <_start.loop>: : _start.loop(): : indep-mov.asm:15 16 : 4000ae: mov eax,0x1 15 : 4000b3: mov ebx,0x2 22 : 4000b8: mov edi,0x3 25 : 4000bd: mov edx,0x4 14 : 4000c2: mov r8d,0x5 19 : 4000c8: mov r9d,0x6 25 : 4000ce: mov r10d,0x7 18 : 4000d4: mov r11d,0x8 22 : 4000da: mov eax,0x1 24 : 4000df: mov ebx,0x2 20 : 4000e4: mov edi,0x3 29 : 4000e9: mov edx,0x4 28 : 4000ee: mov r8d,0x5 18 : 4000f4: mov r9d,0x6 21 : 4000fa: mov r10d,0x7 19 : 400100: mov r11d,0x8 26 : 400106: mov eax,0x1 18 : 40010b: mov ebx,0x2 29 : 400110: mov edi,0x3 19 : 400115: mov edx,0x4

The first column shows the number of interrupts received for each instruction. Specially, the number of times an instruction would be the next instruction to execute following the interrupt.

Without doing any deep statistical analysis, I don’t see any particular pattern here. Every instruction gets its time in the sun. Some columns have somewhat higher values than others, but if you repeat the measurements, the columns with higher values don’t necessarily repeat.

We can try the exact same thing, but with add instructions like this:

add eax , 1 add ebx , 2 add edi , 3 add edx , 4 add r8d , 5 add r9d , 6 add r10d , 7 add r11d , 8

We expect the execution behavior to be similar to the mov case: we do have dependency chains here but 8 separate ones (for each destination register) for a 1 cycle instruction so there should be little practical impact. Indeed, the results are basically identical to the last experiment so I won’t show them here (you can see them yourself with the indep-add test).

Let’s get moving here and try something more interesting. This time we will again use all add instructions, but two of the adds will depend on each other, while the other two will be independent. So the chain shared by those two adds will be twice as long (2 cycles) as the other chains (1 cycle each). Like this:

add rax , 1 ; 2-cycle chain add rax , 2 ; 2-cycle chain add rsi , 3 add rdi , 4

Here the chain through rax should limit the throughput of the above repeated block to 1 per 2 cycles, and indeed I measure an IPC of 2 (4 instructions / 2 cycles = 2 IPC ).

Here’s the interrupt distribution:

0 : 4000ae: add rax,0x1 82 : 4000b2: add rax,0x2 112 : 4000b6: add rsi,0x3 0 : 4000ba: add rdi,0x4 0 : 4000be: add rax,0x1 45 : 4000c2: add rax,0x2 144 : 4000c6: add rsi,0x3 0 : 4000ca: add rdi,0x4 0 : 4000ce: add rax,0x1 44 : 4000d2: add rax,0x2 107 : 4000d6: add rsi,0x3 (pattern repeats...)

This is certainly something new. We see that all the interrupts fall on the middle two instructions, one of which is part of the addition chain and one which is not. The second of the two locations also gets about 2-3 times as many interrupts as the first.

A Hypothesis

Let’s make a hypothesis now so we can design more tests.

Let’s guess that interrupts select instructions which are the oldest unretired instruction, and that this selected instruction is allowed to complete hence samples fall on the next instruction (let us call this next instruction the sampled instruction). I am making the distinction between selected and sampled instructions rather than just saying “interrupts sample instructions that follow the oldest unretired instruction” because we are going to build our model almost entirely around the selected instructions, so we want to name them. The characteristics of the ultimately sampled instructions (except their positioning after selected instructions) hardly matters.

Without a more detailed model of instruction retirement, we can’t yet explain everything we see - but the basic idea is instructions that take longer, hence are more likely to be the oldest unretired instruction, are the ones that get sampled. In particular, if there is a critical dependency chain, instructions in that chain are likely be sampled at some point.

Let’s take a look at some more examples. I’m going to switch using mov rax, [rax] as my long latency instruction (4 cycles latency) and nop as the filler instruction not part of any chain. Don’t worry, nop has to allocate and retire just like any other instruction: it simply gets to skip execution. You can build all these examples with a real instruction like add and they’ll work in the same way.

Let’s take a look at a load followed by 10 nops :

.loop: %rep 10 mov rax , [ rax ] times 10 nop %endrep dec rcx jne .loop

The result:

0 : 4000ba: mov rax,QWORD PTR [rax] 33 : 4000bd: nop 0 : 4000be: nop 0 : 4000bf: nop 0 : 4000c0: nop 11 : 4000c1: nop 0 : 4000c2: nop 0 : 4000c3: nop 0 : 4000c4: nop 22 : 4000c5: nop 0 : 4000c6: nop 0 : 4000c7: mov rax,QWORD PTR [rax] 15 : 4000ca: nop 0 : 4000cb: nop 0 : 4000cc: nop 0 : 4000cd: nop 13 : 4000ce: nop 1 : 4000cf: nop 0 : 4000d0: nop 0 : 4000d1: nop 35 : 4000d2: nop 0 : 4000d3: nop 0 : 4000d4: mov rax,QWORD PTR [rax] 16 : 4000d7: nop 0 : 4000d8: nop 0 : 4000d9: nop 0 : 4000da: nop 14 : 4000db: nop 0 : 4000dc: nop 0 : 4000dd: nop 0 : 4000de: nop 31 : 4000df: nop 0 : 4000e0: nop 0 : 4000e1: mov rax,QWORD PTR [rax] 22 : 4000e4: nop 0 : 4000e5: nop 0 : 4000e6: nop 0 : 4000e7: nop 16 : 4000e8: nop 0 : 4000e9: nop 0 : 4000ea: nop 0 : 4000eb: nop 24 : 4000ec: nop 0 : 4000ed: nop

The selected instructions are the long-latency mov -chain, but also two specific nop instructions out of the 10 that follow: those which fall 4 and 8 instructions after the mov . Here we can see the impact of retirement throughput. Although the mov -chain is the only thing that contributes to execution latency, this Skylake CPU can only retire up to 4 instructions per cycle (per thread). So when the mov finally retires, there will be two cycles of retiring blocks of 4 nop instructions before we get to the next mov :

; retire cycle mov rax , QWORD PTR [ rax ] ; 0 (execution limited) nop ; 0 nop ; 0 nop ; 0 nop ; 1 <-- selected nop nop ; 1 nop ; 1 nop ; 1 nop ; 2 <-- selected nop nop ; 2 nop ; 2 mov rax , QWORD PTR [ rax ] ; 4 (execution limited) nop ; 4 nop ; 4

In this example, the retirement of mov instructions are “execution limited” - i.e., their retirement cycle is determined by when they are done executing, not by any details of the retirement engine. The retirement of the other instructions, on the other hand, is determined by the retirement behavior: they are “ready” early but cannot retire because the in-order retirement pointer hasn’t reached them yet (it is held up waiting for the mov to execute).

So those selected nop instructions aren’t particularly special: they aren’t slower than the rest or causing any bottleneck. They are simply selected because the pattern of retirement following the mov is predictable. Note also that it is the first nop that in the group of 4 that would otherwise retire that is selected.

This means that we can construct an example where an instruction on the critical path never gets selected:

%rep 10 mov rax , [ rax ] nop nop nop nop nop add rax , 0 %endrep

The results:

0 : 4000ba: mov rax,QWORD PTR [rax] 94 : 4000bd: nop 0 : 4000be: nop 0 : 4000bf: nop 0 : 4000c0: nop 17 : 4000c1: nop 0 : 4000c2: add rax,0x0 0 : 4000c6: mov rax,QWORD PTR [rax] 78 : 4000c9: nop 0 : 4000ca: nop 0 : 4000cb: nop 0 : 4000cc: nop 18 : 4000cd: nop 0 : 4000ce: add rax,0x0

The add instruction is on the critical path: it increases the execution time of the block from 4 cycles to 6 cycles, yet it is never selected. The retirement pattern looks like:

/- scheduled | /- ready | | /- complete | | | /- retired mov rax , [ rax ] ; 0 0 5 5 <-- selected nop ; 0 0 0 5 <-- sample nop ; 0 0 0 5 nop ; 0 0 0 5 nop ; 1 1 1 6 <-- selected nop ; 1 1 1 6 <-- sampled add rax , 0 ; 1 5 6 6 mov rax , [ rax ] ; 1 6 11 11 <-- selected nop ; 2 2 2 11 <-- sampled nop ; 2 2 2 11 nop ; 2 2 2 11 nop ; 2 2 2 12 <-- selected nop ; 3 3 3 12 <-- sampled add rax , 0 ; 3 11 12 12 mov rax , [ rax ] ; 3 12 17 17 <-- selected nop ; 3 3 3 17 <-- sampled

On the right hand side, I’ve annotated each instruction with several key cycle values, described below.

The scheduled column indicates when the instruction enters the scheduler and hence could execute if all its dependencies were met. This column is very simple: we assume that there are no front-end bottlenecks and hence we schedule (aka “allocate”) 4 instructions every cycle. This part is in-order: instructions enter the scheduler in program order.

The ready column indicates when all dependencies of a scheduled instruction have executed and hence the instruction is ready to execute. In this simple model, an instruction always begins executing when it is ready. A more complicated model would also need to model contention for execution ports, but here we don’t have any such contention. Instruction readiness occurs out of order: you can see that many instructions become ready before older instructions (e.g., the nop instructions are generally ready before the preceding mov or add instructions). To calculate this column take the maximum of the ready column for this instruction and the completed column for all previous instructions whose outputs are inputs to this instruction.

The complete column indicates when an instruction finishes execution. In this model it simply takes the value of the ready column plus the instruction latency, which is 0 for the nop instructions (they don’t execute at all, so they have 0 effective latency), 1 for the add instruction and 5 for the mov . Like the ready column this happens out of order.

Finally, the retired column, what we’re really after, shows when the instruction retires. The rule is fairly simple: an instruction cannot retire until it is complete, and the instruction before it must be retired or retiring on this cycle. No more than 4 instructions can retire in a cycle. As a consequence of the “previous instruction must retired” part, this column is only increasing and so like the first column, retirement is in order.

Once we have the retired column filled out, we can identify the <-- selected instructions: they are the ones where the retirement cycle increases. In this case, selected instructions are always either the mov instruction (because of its long latency, it holds up retirement), or the fourth nop after the mov (because of the “only retire 4 per cycle” rule, this nop is at the head of the group that retires in the cycle after the mov retires). Finally, the sampled instructions which are those will actually show up in the interrupt report are simply the instruction following each selected instruction.

Here, the add is never selected because it executes in the cycle after the mov , so it is eligible for retirement in the next cycle and hence doesn’t slow down retirement and so doesn’t behave much differently than a nop for the purposes of retirement. We can change the position of the add slightly, so it falls in the same 4-instruction retirement window as the mov , like this:

%rep 10 mov rax , [ rax ] nop nop add rax , 0 nop nop nop %endrep

We’ve only slid the add up a few places. The number of instructions is the same and this block executes in 6 cycles, identical to the last example. However, the add instruction now always gets selected:

21 : 4000ba: mov rax,QWORD PTR [rax] 54 : 4000bd: nop 0 : 4000be: nop 0 : 4000bf: add rax,0x0 15 : 4000c3: nop 0 : 4000c4: nop 0 : 4000c5: nop 0 : 4000c6: mov rax,QWORD PTR [rax] 88 : 4000c9: nop 0 : 4000ca: nop 0 : 4000cb: add rax,0x0 14 : 4000cf: nop 0 : 4000d0: nop 0 : 4000d1: nop 0 : 4000d2: mov rax,QWORD PTR [rax] 91 : 4000d5: nop 0 : 4000d6: nop 0 : 4000d7: add rax,0x0 13 : 4000db: nop

Here’s the cycle analysis and retirement pattern for this version:

/- scheduled | /- ready | | /- complete | | | /- retired mov rax , [ rax ] ; 0 0 5 5 <-- selected nop ; 0 0 0 5 <-- sample nop ; 0 0 0 5 add rax , 0 ; 0 5 6 6 <-- selected nop ; 1 1 1 6 <-- sampled nop ; 1 1 1 6 nop ; 1 1 1 6 mov rax , [ rax ] ; 1 6 11 11 <-- selected nop ; 2 2 2 11 <-- sampled nop ; 2 2 2 11 add rax , 0 ; 2 11 12 12 <-- selected nop ; 2 2 2 12 <-- sampled nop ; 3 3 3 12 nop ; 3 3 3 12 mov rax , [ rax ] ; 3 12 17 17 <-- selected nop ; 3 3 3 17 <-- sampled

Now might be a good time to note that we also care about the actual sample counts, and not just their presence or absence. Here, the samples associated with the mov are more frequent than the samples associated with the add . In fact, there are about 4.9 samples for mov for every sample for add (calculated over the full results). That lines up almost exactly with the mov having a latency of 5 and the add a latency of 1: the mov will be the oldest unretired instruction 5 times as often as the add . So the sample counts are very meaningful in this case.

Going back to the cycle charts, we know the selected instructions are those where the retirement cycle increases. To that we add that the size of the increase determines their selection weight: the mov instruction has a weight of 5, since it jumps (for example) from 6 to 11 so it is the oldest unretired instruction for 5 cycles, while the nop instructions have a weight of 1.

This lets you measure in-situ the latency of various instructions, as long as you have accounted for the retirement behavior. For example, measuring the following block ( rdx is zero at runtime):

mov rax , [ rax ] mov rax , [ rax + rdx ]

Results in samples accumulating in a 4:5 ratio for the first and second lines: reflecting the fact that the second load has a latency of 5 due to complex addressing, while the first load takes only 4 cycles.

Branches

What about branches? I don’t find anything special about branches: whether taken or untaken they seem to retire normally and fit the pattern described above. I am not going to show the results but you can play with the branches test yourself if you want.

Atomic Operations

What about atomic operations? Here the story does get interesting.

I’m going to use lock add QWORD [rbx], 1 as my default atomic instruction, but the story seems similar for all of them. Alone, this instruction has a “latency” of 18 cycles. Let’s put it in parallel with a couple of SIMD instructions that have a total latency of 20 cycles alone:

vpmulld xmm0 , xmm0 , xmm0 vpmulld xmm0 , xmm0 , xmm0 lock add QWORD [ rbx ], 1

This loop still takes 20 cycles to execute. That is, the atomic costs nothing in runtime: the performance is the same if you comment it out. The vpmulld dependency chain is long enough to hide the cost of the atomic. Let’s take a look at the interrupt distribution for this code:

0 : 4000c8: vpmulld xmm0,xmm0,xmm0 12 : 4000cd: vpmulld xmm0,xmm0,xmm0 10 : 4000d2: lock add QWORD PTR [rbx],0x1 244 : 4000d7: vpmulld xmm0,xmm0,xmm0 0 : 4000dc: vpmulld xmm0,xmm0,xmm0 26 : 4000e1: lock add QWORD PTR [rbx],0x1 299 : 4000e6: vpmulld xmm0,xmm0,xmm0 0 : 4000eb: vpmulld xmm0,xmm0,xmm0 35 : 4000f0: lock add QWORD PTR [rbx],0x1 277 : 4000f5: vpmulld xmm0,xmm0,xmm0 0 : 4000fa: vpmulld xmm0,xmm0,xmm0 33 : 4000ff: lock add QWORD PTR [rbx],0x1 302 : 400104: vpmulld xmm0,xmm0,xmm0 0 : 400109: vpmulld xmm0,xmm0,xmm0 33 : 40010e: lock add QWORD PTR [rbx],0x1 272 : 400113: vpmulld xmm0,xmm0,xmm0 0 : 400118: vpmulld xmm0,xmm0,xmm0 31 : 40011d: lock add QWORD PTR [rbx],0x1 280 : 400122: vpmulld xmm0,xmm0,xmm0 0 : 400127: vpmulld xmm0,xmm0,xmm0 40 : 40012c: lock add QWORD PTR [rbx],0x1 277 : 400131: vpmulld xmm0,xmm0,xmm0 0 : 400136: vpmulld xmm0,xmm0,xmm0 21 : 40013b: lock add QWORD PTR [rbx],0x1 282 : 400140: vpmulld xmm0,xmm0,xmm0 0 : 400145: vpmulld xmm0,xmm0,xmm0 35 : 40014a: lock add QWORD PTR [rbx],0x1 291 : 40014f: vpmulld xmm0,xmm0,xmm0 0 : 400154: vpmulld xmm0,xmm0,xmm0 35 : 400159: lock add QWORD PTR [rbx],0x1 270 : 40015e: dec rcx 0 : 400161: jne 4000c8 <_start.loop>

The lock add instructions are selected by the interrupt close to 90% of the time, despite not contributing to the execution time. Based on our mental model, these instructions should be able to run ahead of the vpmulld loop and hence be ready to retire as soon as they are the head of the ROB . The effect that we see here is because lock -prefixed instructions are execute at retire. This is a special type of instruction that waits until it at the head of the ROB before it executes.

So this instruction will always take a certain minimum amount of time as the oldest unretired instruction: it never retires immediately regardless of the surrounding instructions. In this case, it means retirement spends most of its time waiting for the locked instructions, while execution spends most of its time waiting for the vpmulld . Note that the retirement time added by the locked instructions wasn’t additive with that from vpmulld : the time it spends waiting for retirement is subtracted from the time that would otherwise be spent waiting on the multiplication retirement. That’s why you end up with a lopsided split, not 50/50. We can see this more clearly if we double the number of multiplications to 4:

vpmulld xmm0 , xmm0 , xmm0 vpmulld xmm0 , xmm0 , xmm0 vpmulld xmm0 , xmm0 , xmm0 vpmulld xmm0 , xmm0 , xmm0 lock add QWORD [ rbx ], 1

This takes 40 cycles to execute, and the interrupt pattern looks like:

0 : 4000c8: vpmulld xmm0,xmm0,xmm0 18 : 4000cd: vpmulld xmm0,xmm0,xmm0 49 : 4000d2: vpmulld xmm0,xmm0,xmm0 133 : 4000d7: vpmulld xmm0,xmm0,xmm0 152 : 4000dc: lock add QWORD PTR [rbx],0x1 263 : 4000e1: vpmulld xmm0,xmm0,xmm0 0 : 4000e6: vpmulld xmm0,xmm0,xmm0 61 : 4000eb: vpmulld xmm0,xmm0,xmm0 160 : 4000f0: vpmulld xmm0,xmm0,xmm0 168 : 4000f5: lock add QWORD PTR [rbx],0x1 251 : 4000fa: vpmulld xmm0,xmm0,xmm0 0 : 4000ff: vpmulld xmm0,xmm0,xmm0 59 : 400104: vpmulld xmm0,xmm0,xmm0 166 : 400109: vpmulld xmm0,xmm0,xmm0 162 : 40010e: lock add QWORD PTR [rbx],0x1 267 : 400113: vpmulld xmm0,xmm0,xmm0 0 : 400118: vpmulld xmm0,xmm0,xmm0 60 : 40011d: vpmulld xmm0,xmm0,xmm0 160 : 400122: vpmulld xmm0,xmm0,xmm0 155 : 400127: lock add QWORD PTR [rbx],0x1 218 : 40012c: vpmulld xmm0,xmm0,xmm0 0 : 400131: vpmulld xmm0,xmm0,xmm0 58 : 400136: vpmulld xmm0,xmm0,xmm0 144 : 40013b: vpmulld xmm0,xmm0,xmm0 154 : 400140: lock add QWORD PTR [rbx],0x1 pattern continues...

The sample counts are similar for lock add but they’ve now increased to a comparable amount (in total) for the vmulld instructions. In fact, we can calculate how long the lock add instruction takes to retire, using the ratio of the times it was selected compared to the known block throughput of 40 cycles. I get about a 38%-40% rate over a couple of runs which corresponds to a retire time of 15-16 cycles, only slightly less than then back-to-back latency of this instruction.

Can we do anything with this information?

Well one idea is that it lets us fairly precisely map out the retirement timing of instructions. For example, we can set up an instruction to test, and a parallel series of instructions with a known latency. Then we observe what is selected by interrupts: the instruction under test or the end of the known-latency chain. Whichever is selected has longer retirement latency and the known-latency chain can be adjusted to narrow it down exactly.

Of course, this sounds way harder than the usual way of measuring latency: a long series of back-to-back instrucitons, but it does let us measure some things “in situ” without a long chain, and we can measure instructions that don’t have an obvious way to chain (e.g,. have no output like stores or different-domain instructions).

Some Things That I Didn’t Get To

Explain the variable (non-self synchronizing) results in terms of retire window patterns

Check interruptible instructions

Check mfence and friends

and friends Check the execution effect of atomic instructions (e.g., blocking load ports)

Feedback of any type is welcome. I don’t have a comments system yet, so as usual I’ll outsource discussion to this HackerNews thread.

Thanks and Attribution

Thanks to HN user rrss for pointing out errors in my cycle chart.

Intel 8259 image by Wikipedia user German under CC BY-SA 3.0.

If you liked this post, check out the homepage for others you might enjoy.



