Tacit Murky wrote:

Looks like it's about the number of renamed registers. Agreed. Simply changing Nathan's loops to use an immediate instead of a register for `max` produces a dramatic speedup on HSW:

Nathan's 2 micro / 2 macro on my HSW: one iteration per 1.42275c (~4.21 unfused-domain uops per clock). Very consistent, +- 0.0001 cycles per iter (for 1G iterations).

cmp r,imm instead of cmp r,max for both compares : one iteration per ~1.12c (~5.35 unfused-domain uops per clock). Pretty noisy, from 1.116c to 1.124c per iter. My Skylake results match Nathan's: this bottleneck is gone, so the loop always runs at 1.0 cycles per iteration. (6 unfused-domain uops / clock). HSW and SKL measured with `perf stat` on Linux 4.8 and 4.10, counting only user-space counts for a statically linked binary. With 10^9 iterations on an otherwise-idle system, this is an easy way to get accurate low-noise numbers. Skylake isn't perfect: some runs are as bad as 1.02c / iter (for these and other loops). I think this is due to settling into a sub-optimal pattern rather than measurement noise, at least in some cases. IDK why my HSW result is so much faster than Nathan's (1.42c instead of 1.73c). I measured on an i5-4210U and i7-6700k, with HT enabled but inactive (no other processes running). I still get stable and matching results even with max=10^7. The top of my loop is 32B-aligned, and both memory addresses are 64B-aligned. I haven't tried to construct a loop that reads even more registers per clock. e.g. 3-operand FMA with a micro-fused memory operand, unrolled with different registers to avoid a latency bottlenck. Or ADC (flag input as well as flag output). Hwl can't allow more than 5 unfused µIPC. That's not right. With a somewhat artificial example, I can get HSW to sustain 6 unfused-domain uops per ~1.00 clocks (see below). It seems more like a register-read limit, since reducing the number of input registers makes it run faster. (e.g. changing a macro-fused cmp/jcc to an inc helps. Perhaps also because of reduced resource conflicts (the not-taken branch stealing cycles on p6), but maybe not because Skylake doesn't have that problem. .loop: ;; runs at 1.053c per iter on HSW add rax, [rdi] inc ebx blsi rdx, [rsp] dec ecx ; ecx = max to start. jnz .loop Predicted-not-taken CMP r,r/JCC has 2 inputs, 1 output (just flags). INC r has 1 input, 2 outputs (r and partial-flags). With an ADD r,m instead of BLSI r,m, the loop runs at 1.08c per iteration on HSW. (Still about 1.00c on SKL). BLSI's destination register is write-only, unlike ADD's. This is also one fewer loop-carried dep chain, which may be significant. Replacing both ADDs with BLSI slows it down (to 1.076c per iter on HSW, 1.05c per iter on SKL), presumably because of imperfect scheduling leading to resource conflicts, since BLSI can only run on p15. I got a slowdown on HSW and SKL from using imul r,m,imm to replace the second ADD, which is weird because its destination is write-only and out-of-order execution should easily hide its 3c latency. Presumably resource-conflicts for p1 are a problem. SKL: 1.29c to 1.55c (highly variable). HSW: more stable around 1.455c +- 0.05. IMUL writes flags, but BLSI doesn't. (Using add ebx,1 instead of inc didn't help, but using test ebx,ebx instead of inc did speed it up to about 1.18c on both HSW and SKL. I guess having 1 duplicated input and 1 output instead of 2 does help!) --- With a somewhat artificial example, I can hit 1.005c on HSW (still not as fast as SKL's 1.0005c best-case for this: 10 times as far away from 1c per iter). Perhaps HSW is hitting PRF limitations. Using a micro-fused AVX instruction splits things between the integer and vector PRFs. .loop: vpaddd xmm0, xmm0, [rdi] test ebx, ebx test rdx, [rsp] dec ecx jnz .loop Strangely, VPABSD xmm, m (write-only destination) was slower (1.04c) than VPADDD xmm0,xmm0,m (read-modify-write dest). This might be from resource conflicts, since it's also slower on SKL (1.004c to 1.015c). It's odd because HSW runs it on the same two ports as VPADDD. (SKL runs it on p01, but VPADDD on p015). Avoiding the loop-carried dependency with VPADDD xmm0, xmm1, [rdi] was slightly slower on HSW (1.043c) than VPADDD xmm0,xmm0,[rdi], which smells like a register-read bottleneck on reading "cold" registers from the PRF. Non-loop-carried dependency chains between two instructions in the loop seems to prevent it from running at 1c per iteration, even on SKL. (e.g. test ecx,ecx is a problem when ecx was written by the macro-fused loop-branch dec ecx/jnz, slowing HSW down to 1.068c). ---- Using indexed addressing-modes makes it run slower even on SKL. (But micro-fusion still happens on both HSW and SKL. Apparently un-lamination before the IDQ for indexed addressing modes only applies to SnB/IvB, not HSW! We already knew it didn't apply to SKL, but I had been assuming that change was new with SKL. I only got a HSW perf-counter test setup this week.) ;rsi=r8=0 ;rsp and rdi are both 64B aligned. rdi points into the BSS, in case that matters. .loop: add rdx, [rsp+rsi*4] cmp r11, r12 jne .end ; never taken, r11==r12 add ebx, [rdi+r8*4] sub ecx, r9d ; alternatively, sub ecx,1 to replace a reg with an immediate jnz .loop .end: Notice that although this is very similar to Nathan's two_micro_two_macro, there are no dependencies between any of the fused-domain uops. The loop-exit condition is just from decrementing ecx with a macro-fused uop. This reads 7 "cold" registers (addressing modes, r11, r12, and r9), and 3 hot registers (rdx, ebx, and ecx) per iteration. It writes the 3 hot registers once each, and flags 4 times. SKL runs it at 1.5566c / iter. Input registers per clock: 6.42 total, 4.50 cold, 1.93 hot. Total non-flag regs read+written per cycle: 8.35 = 13/1.5566. There's clearly a bottleneck, but IDK what it is. Touching fewer regs in the other 2 fused-domain uops makes it possible to micro-fuse indexed addressing modes and still run at 1c / iter on SKL. HSW runs it a 1.6327c per iteration. Input registers per clock: 6.12 total, 4.29 cold, 1.83 hot. Total non-flag regs read+written per cycle: 7.96 = 13/1.6327. uops_issued.stall_cycles shows that the front-end stalled instead of issuing a group of less than 4 (on HSW and SKL). Reducing the number of cold inputs regs in different ways has different effects, so it's not as simple as just a bottleneck on that.

Changing the addressing mode on the second add to just [rdi], HSW runs it at 1.631c / iter. (very slightly faster than indexed)

Changing the CMP r11,r12 to TEST r11,r11 has no effect (same 1.6327c / iter)

Changing the CMP r11,r12 to CMP r11, 0 speeds it up to 1.594c / iter.

Changing the CMP r11,r12 to CMP r9d, 1 also speeds it up to 1.594c / iter (even though r9d is also read by sub, so it's not like P6-family register-read stalls where reading the same cold reg twice doesn't use extra resources)

Changing the CMP r11,r12/jne to CMP rdx,0/jl speeds it up to 1.35c / iter. (rdx was written the the previous ADD uop, so this macro-fused uop has no cold inputs anymore)

Using sub ecx,1 instead of sub ecx,r9d), HSW runs it at 1.3895c/iter +-0.0001 on HSW. Input regs per clock: 6.48 total, 4.31 cold, 2.15 hot. Total non-flags read+written: 8.64/c = 12/1.3895c. The results of these different changes are similar on SKL; things that speed up HSW significantly also speed up SKL. I'm not sure if it matters whether input registers are cold or not (read from the PRF vs. forwarded from a not-yet-executed uop), or whether there's a different cause for what I'm seeing. Futher testing is needed. Interesting things that could be tested: micro-fused FMA with a base+index addressing mode should be a 4-input fused-domain uop. (or maybe this will be unlaminated)

On Skylake, ADCX / ADOX if they micro-fuse. (ADC doesn't, according to the instruction tables). Or even just ADC r,r might be interesting.

Does add r,r matter vs. andn r,r,r? I'm guessing not, since register renaming turns a RMW of an architectural register into a write to a new physical register anyway.