This is a very long, very dense, and very technical foray into CPU architecture. I’ve tried to make this approachable to enthusiast non-professionals, but if you don’t know roughly how CPUs work, you will struggle.

Contents

Introduction The fundamental work of an instruction

— Coarse-Grained Out-of-Order (CG-OoO)

— The Mill Dependencies, not order

1. False dependencies

2. Control Flow

1. Branch instructions impose ordering constraints ...

Idea: Branch Predict and Branch Verify

2. Branches outside the current context are opaque ...

3. Variable-latency instructions Memory

— Baseline approaches

— Mill Computing's approach

Issue 1: Function boundaries still prevent ...

Issue 2: Loads potentially in the critical ...

Issue 3: Converging dataflow

Issue 4: It interacts badly with loop pipelining

— Mill Computing's approach, cont.

— Ideas for a better load

Idea 1: Prodigy-Mill hybrid approach

Idea 2: Restricted Asynchronous Dataflow

Idea 3: Source-agnostic argument pickups

Idea 4: Standard prefetch

— Security Concluding Remarks

Introduction

I’m tired of the dominance of the out-of-order processor. They are large and wasteful, the ever-popular x86 is especially poor, and they are hard to understand. Their voodoo would be more appreciated if they pushed better at the limits of computation, but it’s obvious that the problems people solve have a latent inaccessible parallelism far in excess of what an out-of-order core can extract. The future of computing should surely not rest on terawatts of power burnt to pretend a processor is simpler than it is.

There is some hope in the ideas of upstarts, like Mill Computing and Tachyum, as well as research ideas like CG-OoO. I don’t know if they will ever find success. I wouldn’t bet on it. Heck, the Mill might never even get far enough to have the opportunity to fail. Yet I find them exciting, and much of the offhand “sounds like Itanium” naysay is uninteresting.

This article focuses on architectures in proportion to how much creative, interesting work they’ve shown in public. This means much of this article comments on the Mill architecture, there is a healthy amount on CG-OoO, and the Tachyum is mentioned only in passing. There are more I could have discussed, like TRIPS (an older dataflow architecture), Denver (NVIDIA’s newish ARM architecture using binary translation) or VISC (Soft Machine’s ‘threadlet’ architecture that got killed by Intel), but I chose to look at the architectures that I consider most promising for general-purpose workloads.

Along with details and comments on these technologies, I have tried to supplement the article with my own untested ideas. These are formulated to get people to focus more directly on the underlying limits of the system, and anyone should feel free to take or extend those ideas if they can make something of them. Not all of my ideas are entirely unique, but I suspect some of them are new.

The fundamental work of an instruction

A typical instruction in a typical instruction set performs something like the following:

Read the operands from the global register file.

Perform the operation.

Write the output to the global register file.

This looks inoffensive until you start looking at the critical path between instructions, especially with a high level of parallelism and where the operation is something cheap, like addition.

Read 16 operands from the global register file simultaneously, at arbitrary locations.

Perform 8 simultaneous operations.

Write the 8 results to the global register file simultaneously, at arbitrary locations.

Read 16 new operands from the global register file simultaneously, and subset of which may be any combination of the just-written results.

Perform 8 simultaneous operations.

…

It is difficult enough to figure out how to build a register file that can support this throughput on a cycle-by-cycle basis. It is another to use this as a primary method for routing data between functional units. You can only really mitigate the former issue, but the latter has a standard and inoffensive solution: operand forwarding.

If the functional units need their output again at a low latency, they route through the highlighted path. Most such instructions still need to be written to the register file unless they are immediately overwritten, in case another instruction later on requires them.

This approach helps the situation, but scaling can still be a challenge. A Skylake CPU has 348 physical registers per core, split over two different register files. In contrast, there are 8 execution pipelines, only half of which have general ALUs, so the forwarding path is much smaller.

Coarse-Grained Out-of-Order (CG-OoO)

Modern cores have enough transistors to build these huge register files — a modern smartphone has billions — but they add latency, long wires and communication costs. The cost also scales with the increasing number of read and write ports. Even more importantly, out-of-order hardware struggles to scale because the core has to keep track of dependencies globally between a vast number of buffered instructions.

CG-OoO is a design intended to break this set of global dependencies whilst still allowing for long-ranged out-of-order execution. The key observation allowing this is that it seems possible to handle dependencies hierarchically, and circumvent the imperfections in the hierarchical approach — inefficient allocation of resources — by overprovisioning the execution resources.

A simplified version of Fig 1. from the CG-OoO paper. Block Windows (BW) feed instructions into execution clusters, each with a number of Execution Units (EU). A BW contains a small, unshared Local Register File (LRF) and a subset of a shared Global Register File (GRF).

Each Block Window (BW) in the diagram above is responsible for the execution of a single basic block, a region of code without internal control flow that generally averages 5–20 instructions long. Each Block Window contains two small register files. One is a low-throughput Local Register File (LRF), which is only visible to the executing basic block, and another is a segment of a larger Global Register File (GRF), which communicates globally to all other Block Windows.

The Local Register File is used for approximately 1 in 3 operands and supports a BW with minimal amortized throughput, at only 2 reads and 2 writes per cycle. The LRF is short, but generously provisioned relative to the length of the basic block, 20 entries long for their models, avoiding hazards without renaming. According to the authors, this makes the LRF a factor of 25 cheaper per access than a 256-long heavily-multiported register file.

The Global Register File is shared and requires renaming — that is, the production of a mapping from the source register file indices to GRF indices such that all semanic writes are to different GRF indices, with reuse only after a later write to the source register file index is marked as non-speculative. Renaming maps writes by a Block Window to the local GRF segment, so only register reads are global.

Out-of-order execution also introduces costs for instruction issue and wakeup: In a fully out-of-order processor, any write to the LRF could make any locally-queued instruction ready to issue, and any write the the GRF could make any instruction ready to issue from any BW. Therefore, CG-OoO performs only a small amount of out-of-order execution, hierarchically. Within each BW, only from the first three or so unissued instructions are candidates for dispatch. Distant parallelism is only extracted by parallel execution of BWs.

The corresponding cost reductions are worth listing explicitly.

Local instruction reordering is cheap, requiring hazard checking hardware and wakeup only over about three to five local instructions. Global wakeup affects only those instructions that take global registers and are in the small set of candidate operations for each BW. Stalled instructions are sufficiently limited in number that they can cache operands inside the issue hardware. They will then be able to read any missing operands in later cycles from the bypass, so only freshly queued instructions ever perform explicit GRF reads. Each GRF segment requires few ports, since GRF reads will be mostly distributed between segments.

The Mill

Whereas CG-OoO allows for significant aggregate throughput at the expense of limiting local throughput to a single instruction per cycle amortized, and underperforms when local ILP is unusually large, the Mill first attempts to obliterate local throughput limits and uses secondary methods to extract nonlocal parallelism.

You can get most of the motivation needed to invent the Mill’s belt by noticing how much more headroom operand forwarding has than large global register files, primarily because of the comparatively miniscule number of sources. If the principle operations in a CPU should be those that can be done fast and cheap, the bypass should be the primary source of data movement and multiported register files should only be involved when necessary for long-term storage.

We can prioritize the use of the bypass by making operands more available from it using a small amount of buffering of the outputs at the end of each pipeline. In synergy, we can make register file operations opt-in; the Mill calls reads and writes to their register file substitute fill and spill.

The Mill is different yet; their register file is replaced with a ‘scratchpad’, which is allocated per frame and accessed with byte offsets. It has a 3-cycle spill-to-fill latency, and their Silver core can handle only 4 spills and fills per cycle in total — entirely insufficient for a similar-width out-of-order, but easily in excess here.

With this it’s easy to see that this gives a greater opportunity for the hardware to scale, and detaches register pressure from overall throughput, this raises a few questions.

How do you build a bypass that scales with the increased buffering in the system? How do you avoid this impacting the critical path latency, relative to the smaller, faster bypass in a standard processor? How much buffering is needed, and how do you prevent this becoming a bottleneck? How does the program interface with the hardware?

The Mill has a straightforward approach to this challenge. A crossbar, a piece of hardware that connects N sources with M sinks, easily scales to the sizes involved here. Unfortunately, larger crossbars have more latency than the original bypass, slowing clock speeds. A solution to this issue is possible by building a fast path into the crossbar.

Latency-1 instructions immediately exit into a designated latency-1 buffer. Higher latency instructions enter corresponding higher-numbered buffers, and latency-1 instructions also migrate to higher-numbered buffers after their first cycle. These latency-1 buffers are small in number and latency-critical, and enter the crossbar fast-path. Instructions with higher latency are less impacted by increased routing delay and often don’t end cleanly on a cycle boundary, so can start their routing earlier in the cycle using the slow-path crossbar.

The amount of buffering needed is of major concern. Luckily, many values are completely temporary, and thus proper scheduling does not need much buffering for these at all. Further, the register file or scratchpad easily handles many longer-lived values. It would seem that few cycles of buffer may well suffice, to cover those outputs that live only just short enough that they cannot be migrated without overhead to the register file or scratchpad.

The Mill resolves some further rough-edges. Operations are migrated between buffers, and automatically spilled to a side buffer, to extend the life of operands in the case of skew pipeline usage. This automated data movement is made practical in part because these pathological cases are those with low throughput, at least in the Mill’s semantic model. When every pipeline is producing values every cycle, operands fall off the belt very fast. The only time operands last a long time on the belt (without being explicitly recovered, and ignoring function calls) is when the belt is moving slowly.

The last question about the program-facing interface to the hardware is left open, and touched on in later sections.

Dependencies, not order

Exploring the CG-OoO’s approach shed some light on the trade-offs required in the construction of an out-of-order processor. However, the previous discussion intentionally did not clarify whether the Mill’s approach was for an in-order or out-of-order core. In fact, very little of the Mill’s approach seems to directly improve the cost-benefit ratio of out-of-order hardware, as it would seem all of the complex wakeup logic would still have to exist.

One might be tempted to consider a CG-OoO-Mill hybrid, if just for perspective. However, it is difficult to imagine; the belt is optimized for throughput on values with short lifetimes, so regardless of whether the core supports out-of-order execution, we must confront the challenge of extracting static parallelism directly.

Mill Computing claims that in-order computing is possible to do fast, and they give quite a few tools to get there. Before analyzing these, a causal understanding of the specific benefits out-of-order approach is warranted. The following shortlist, though brief, should be nearly exhaustive.

Out-of-order hardware handles false dependencies, such as through condition codes and register hazards. Out-of-order hardware allows near-optimal scheduling around control flow. Out-of-order hardware efficiently schedules in the context of variable-latency instructions, particularly for memory access.

1. False dependencies

Exceptions on integer division by zero and condition codes both create incidental dependencies: exceptions prevent the compiler from reordering instructions outside of loops and ifs, or across globally-visible memory writes, and condition codes make hardware speculation more difficult. The solution to both issues is to put these conditions into software, with division by zero resulting in a register value of some sort, and conditions being handled with booleans and comparisons. This solution is so noncontroversial that even the conservative RISC-V architecture takes this approach. Another issue in this category is exceptions on a failed memory load, but memory is better delegated to its own section due to its complexity.

Register hazards are an unavoidable consequence of using a limited number of names to refer to an unbounded number of source-level values. There are really only two issues from these hazards:

Register hazards force storage elements to be used in program order, impeding dynamic reordering without renaming. Register name limits represent an upper bound on data that can be stored in the primary storage medium without renaming.

The Mill is affected by both issues, since the hardware state of the belt varies depending on how the current control block was reached. The Mill uses renaming to solve the first of these issues, and uses a sizeable scratchpad without renaming for the second. Rename hardware is cheap enough that this is not an unacceptable compromise, and the Mill’s rename tables are smaller than a comparable out-of-order processor’s regardless.

Modified Fig 15 (b). from the CG-OoO paper. The energy cost of rename dominates the rest of decode in their modelled processors, with the stage taking 9%, 14% and 0.2% of total execution power for each processor, respectively.

Whilst rename is an unavoidable cost of this sort of runtime-execution mismatch, it’s far from obvious that full, unstructured rename is the best we can do. In fact, it seems a variant of the CG-OoO architecture might allow for completely removing the rename stage, and the associated ad-hoc global communication.

In an out-of-order CPU, each instruction has its own view of the logical global register file — the program-visible registers at that point in the execution — , which are mapped onto some subset of the physical register file. Because a potentially unique view must exist for all of 200+ instructions in the reorder buffer, it’s important that this state is shared. This is what rename is good at.

In the CG-OoO architecture, each Block Window, not each instruction, has its own view of the logical global register file. Each Block Window writes to any given register at most once, and each basic block has a header that declares which registers will be written to. This makes sharing registers less necessary.

Consider if each Block Window had a complete, separate copy of the logical register file, and propagation happened on write, not read. Then each instruction could use only its local registers. Register forwarding paths could exist separately for each logical register, rather than supporting full, any-register-to-any-register communication.

I can think of two interesting approaches to build this. Each logical register could have an associated bus, and each BW’s copy would be instantiated to know which BW to take its result from. This would be efficient, and would only bottleneck when the same logical register is written to multiple times in the same cycle by different BWs. The second would be to allocate basic blocks to BWs in a specified order, and put each logical register on a special bus that captures the overwrite and not-ready semantics directly.

Basic blocks are allocated to BWs 0, 3, 6, 1, 4, 7, 2, 5, 8, 0, …, in that order, to spread out demand for execution units. Signals can propagate through multiple BWs when the red lines are set, which happens when the corresponding BW will never write to that logical register, so the previous value is preserved. When a value is written, the orange-yellow line is enabled, and the BW’s value is pushed to the bus. The blue lines enable skip connections, which might reduce latency.

In this approach, each register communicates point-to-point, and there is no contention whatsoever, but the fixed order of BWs limits the overall depth of runahead execution.

2. Control Flow

The saving grace of computing has in many ways been the branch predictor. If we assume the predictor is 97% accurate, which is not unreasonable, and branch instructions (both taken and untaken) occur once every 5 instructions, the predictor accurately projects a path on average 5 / 0.03 ≈ 167 instructions into the future. The power of an out-of-order processor is that it is able to reorder instructions as it wills within as much of this window as currently processed. In an out-of-order processor, branch instructions terminate dependency chains — that is, no instruction ever depends or thereby waits on a correctly-predicted branch instruction. So too with CG-OoO, where control flow terminates basic blocks, and many successive basic blocks can evaluate in parallel as their dependencies are resolved.

No in-order approach seems to recognize how much work out-of-order hardware puts in to making control flow a manageable issue — albeit the effort it spends to get that work done is well understood. This is the first junction in this story where I will state I feel the Mill is insufficiently powerful to handle the performance demands of the competition. The frequency of control flow makes getting this right so important as to undermine any approach that fails to handle it well, which makes this discussion particularly interesting.

The approaches to try to extract this kind of behaviour from an in-order machine look at first blush quite limited. Branch prediction is still used to hide the instruction fetch and decode delay, and the Mill has a convincing argument that its predictor could be better than the standard approach, so that should not be a concern, and will not be detailed here. Branch delay slots, where a branch waits a number of cycles before happening, are mostly useless for correctly predicted branches, but the Mill does have a related and interesting idea with their ‘phasing’ concept (albeit they sell this as increasing performance more generally, which it does not). The idea there is equivalent in nature to a branch delay slot a fraction of a cycle long, where sink instructions in the exiting block can interleave with non-sink instructions in the entered block. This ‘tears’ the branch over several cycles, avoiding sparse ‘warm-up’ and ‘cool-down’ instructions where units go underutilized.

standard VLIW phasing (program) phasing (execution)

================================================================ read read modify write read read modify read modify write read modify read modify write read modify write read modify write

=====branch====== ======

modify write read modify write read modify write

===branch===

write read modify write read modify write

=====branch====== =======

read read modify write read modify write read modify modify write read modify write write modify write write

It should be noted that the delays in the VLIW example is illustrative of a typical dependency chain, but unusual instruction distributions could actually favour a standard VLIW.

The remaining 90% of the problem with control flow is of two parts.

2.1 Branch instructions impose ordering constraints on the program

Let’s break down the first of these concerns with an example.

a = j + k

b = x + y

c = m * n

if c > a

/ \

d = a + 7 d = a - b

e = b + d e = d + d

Assume purely for sake of demonstration that addition and subtraction have latency 1, whereas multiplication has latency 2, and that all machines are fully pipelined, have no resource contention, have a superscalar width of 2, and have accurate predictors. Different examples exist for cores with larger scalaraties, but are much harder to diagram.

A standard in-order approach struggles to extract parallelism.

Left branch taken Right branch taken c = m * n a = j + k c = m * n a = j + k

b = x + y b = x + y

c > a c > a

d = a + 7 d = a - b

e = b + d e = d + d

An out-of-order processor can produce an optimal trace, assuming the compiler is smart enough to prioritize c even without a directly exposed pipeline.

Left branch taken Right branch taken c = m * n a = j + k c = m * n a = j + k

b = x + y d = a + 7 b = x + y d = a - b

e = b + d c > a e = d + d c > a

There are two common directions for mitigating this in-order disadvantage, typically part of what is termed ‘superblock scheduling’ in compilers: moving instructions to after the branch executes (‘hoist below’), or executing them early, before the branch (‘hoist above’). Host below allows you to overlap the critical path of the instructions in one basic block with the critical path of instructions in another, improving scheduling, but it adds a dependency on the branch at runtime. Hoist above removes a dependency on the branch, but requires simultaneous evaluation of both targets even when the core has accurate predictions of the actual branch target.

Hoist below

Left branch taken Right branch taken c = m * n a = j + k c = m * n a = j + k

(stall) (stall)

c > a c > a

b = x + y b = x + y

d = a + 7 d = a - b

e = b + d e = d + d Hoist above

Left branch taken Right branch taken a = j + k b = x + y a = j + k b = x + y

c = m * n d = a + 7 c = m * n d = a + 7

D = a - b e = d + d D = a - b e = d + d

E = D + D c > a E = D + D c > a

Hoist-above has the additional issue of requiring disambiguation between the two halves of its computation after the fact (you need to pick which values to keep), and risks exponential blowup if you are speculating more than one branch deep. The disambiguation required in hoist-above also forces following instructions to depend on both paths of the computation, independent of whether one side is much shorter and faster than the other. Compromises with less or mixed levels of hoisting suffer a similar mix of these issues.

The Mill intends to tackle this issue mostly with hoist-above, mitigating the cost of these extra computations by having very cheap (sub-cycle) disambiguation instructions and a significantly wider core than its direct competitors. This is wasteful, and still doesn’t schedule instructions efficiently.

Idea: Branch Predict and Branch Verify

Edit: Tapabrata Ghosh (deepnotderp) points to a paper from 2015 that invented this same technique, Branch vanguard: decomposing branch functionality into prediction and resolution instructions.

This issue seems to be resolvable in part, with a few key insights. The out-of-order core is able to start executing down the right path ahead of time not because it went out of order, but because it trusted its predictor. An in-order processor can do this too! Let me introduce two very unique instructions.

brp (Branch Predict): Takes an address to branch to, and takes the branch if and only if the branch predictor suggests to do so. This instruction does not take a predicate as an argument.

brv (Branch Verify): Takes a predicate, and the tag of a brp to update. If the predicate is false, branches to the other address in the brp designated by a bit in the instruction; fallthrough if down the taken path, taken if down the fallthrough path. The verify instruction is never predicted to be taken. If the verify is taken, the branch predictor is updated for the corresponding Branch Predict instruction instead.

This pair may seem unintuitive at first, but it is really just a simple decomposition of parts of a branch instruction that were already there. This pair of instructions allows getting the good schedules of hoist-below without the dependencies it invokes.

Left branch taken Right branch taken brp brp

c = m * n a = j + k c = m * n a = j + k

b = x + y d = a + 7 b = x + y d = a - b

e = b + d brv c ≤ a e = d + d brv c > a

Note that brp exists purely in control-flow, so like static branches it does not take a pipeline slot, and is only displayed to make the order of instructions clear.

There are still struggles with brp , even if one agrees that it improves the situation.

In the case of converging control flow, where multiple branches lead to the same block, hoist-below can fail, since incoming branches will want to hoist different instruction sequences. Fortunately, unconditional converging jumps can be tackled well by hoist-above.

Hoist-below duplicates code down both sides of the branch. The cost of code duplication is exponential in the depth.

The target branch must keep around the alternative branch target’s registers, in case of a failed branch verification, which is linear in depth.

All Branch Predict targets become converging targets, including brv , which requires state to be consistent regardless of which parent you come from. brv could handle this with dedicated recovery code or hardware support, whichever works best.

, which requires state to be consistent regardless of which parent you come from. could handle this with dedicated recovery code or hardware support, whichever works best. Store instructions cannot be performed speculatively without special hardware support, and load instructions still depend on them.

The exponential cost in code duplication, and significant register pressure at higher depths, means brp should still be used fairly shallowly, maybe two deep. This is vastly less than an out-of-order processor can go, so brp is far from a be-all solution on its own.

2.2 Branches outside the current context are opaque to the compiler

A branch instruction to a function, or especially a branch instruction through a pointer, cannot generally have calculations hoisted across it. Static function calls can be inlined, which works pretty well when the cost of inlining is manageable, but those the compiler decides not to, as well as those that cannot be, are a wall to any standard static compiler.

The Mill has some very interesting approaches here, but nothing that changes the overall cost landscape. Function call overhead on the Mill is incredibly low; this is down to some fancy optimizations such as their single-cycle bulk rename. Function calls chain cheaply and don’t interrupt the local instruction schedule. Memory has some interesting but incomplete interactions, which will be covered in a memory section.

Alas, none of this changes the fact that the Mill can’t interleave the evaluation of two called functions. The only directly mitigating factors seem to be that inlining is fairly aggressive, and indirect branches are the most difficult branches for out-of-order hardware too. Perhaps the Mill’s faster calls and lower mispredict penalty will trade equal for this, but the world might not be so kind.

It would be nice to solve this issue, even if it is not problematic to get wrong as the section from before. I do not know how. This concludes the control flow section.

3. Variable-latency instructions

Scheduling statically for variable-latency instructions is hard. However, a convenient truth of hardware is that the efficient ways to calculate results are generally timing-agnostic with respect to their inputs. As far as I know, there are only three exceptions in a standard processor that are difficult to avoid.

Branches, which are expensive when unpredictable. Hardware handles this by scheduling for the predicted scenario and trying to to avoid mispredicts getting too slow, which works very well. Division, except by a constant. Division instructions are rare enough not to strictly require efficient handling, with the Mill taking the unusual approach of just doing divides in software (utilizing fast hardware approximations to reduce the cost). Memory instructions. This is unavoidable and performance critical.

Some other, less ‘standard’ instructions exist, such as transcendentals, which I will skip over.

Because the other two concerns are nearly non-issues, only the third point about memory will be addressed in detail. Other potential long-latency instructions can sometimes be modelled as memory accesses. Further, since memory latency interacts with the previously delegated concerns about memory, the topic deserves a section of its own, and this section will be cut short.

Memory

If the previous section gave the impression that, say, control flow is a particularly tough problem, but inventive methods can give powerful and practical tools to tackle it… good, that was the intent. In this section I want you to take that expectation and dash it against the rocks — memory is hard. Though there are prospective tools to tackle it, nothing truly looks like a solution, and it’s not clear that the most effective-looking solutions are even practical.

Memory is not just an issue because it is complex — but it is!—; memory is an issue because of how easily it can dwarf the concerns of the rest of the system. Whereas a branch predictor might fail once every hundred and fifty instructions, burning a dozen cycles in its wake each time it does, poorly written code may miss on cache at regular intervals, and a DRAM read takes literally hundreds of cycles. One load instruction can take more time to complete than the combined cost of 500 others, just by missing cache. A single load instruction that causes a TLB miss may take vastly longer still.

An inexhaustive overview of the issues involved will aid the following discussion on how different approaches tackle them.

Reordering loads and stores over each other is normally only possible when they do not alias (they refer to different parts of memory). Despite this, hardware frequently wants to start loading from a known address before other addresses that might conflict with it are known. Memory accesses can fault, which normally results in an exception. This further prevents reordering of memory operations. Memory accesses take anywhere from a handful of cycles, commonly 3–4 for the top-level cache, to hundreds in the worst case. It is generally impossible provide guaranteed timings needed for static scheduling. Scheduling for throughput generally requires assuming latencies will be short, and scheduling for long latencies requires mixing distant instructions and can have lower throughput since it extends critical paths. Whether a memory instruction will miss cache is hard to predict. Compilers seldom have any nontrivial understanding of which instructions will be slow. Any given section of code might have multiple optimal schedules, depending on which instruction miss what levels of cache, if any. The prevalence of memory instructions and the small size of the top-level cache means that this risk is not safe to disregard. Hardware can handle multiple cache misses simultaneously, termed memory level parallelism. The opportunity to do so goes up when significant cache misses occur, but extracting this parallelism requires the architecture to be able to execute operations correspondingly far ahead in the program during those cache misses. Speculating on memory can cause side channel attacks, the most famous being Spectre. These concerns will only be discussed at the end of this topic.

Baseline approaches

Two historical approaches to memory are useful as a baseline to compare against. The simplest approach that significantly improves the situation for in-order processors is to delay a stall on an unfinished memory load until the result is demanded by a later instruction. A load-store queue is used to check that aliasing memory accesses do not reorder improperly relative to each other, and the core stalls to resolve conflicts if necessary. In theory, this approach can even avoid stalls on register-register moves, if proper bookkeeping is done. With appropriate scheduling by the compiler, this technique allows some load latency to be hidden. In certain circumstances a load might even be resolved after branches or function calls.

There are many restrictions to this approach. A load instruction cannot be issued as soon as its address is available if it has the potential to depend on a store later in the program; the hardware semantics mean that the declared program order is canonical. The instruction that first uses (‘forces’) the load is scheduled statically, but the optimal position will depend on whether the load stalls. The load consumes a register as soon as it is issued, which increases the cost of delaying the consumer of the result. A memory operation also cannot be issued speculatively in the case that it might fault. This combination of concerns means the distance between the load and the forcing of the load will normally have to be small, so only small latencies can be hidden by it.

Runahead is a much more heavyweight attempt to tackle the problem, with some popularity today, such as in NVIDIA’s Denver architecture. Runahead directs its effort primarily to the problem of extracting memory level parallelism during long memory stalls. After a cache miss has occurred, the processor speculatively executes down the most likely instruction path, ignoring any register values that are not known (eg. the missed load) or take too long to evaluate (eg. other loads that miss cache). When the initial cache miss has resolved, this speculative work is discarded. The intent of this speculative first pass is to start loads early, bringing their values into cache. Although the instructions executed are wasted, after resuming standard execution the core is less likely to miss cache, resulting in overall higher throughput.

Runahead is much cheaper than a traditional out-of-order approach to speculative execution, yet can still comfortably run many hundreds of instructions ahead, through arbitrary control flow and function calls. It tackles the most pathological memory issue, a cold cache, and it can synergize with other approaches. A common extension is to use address prediction, allowing early loads even when addresses depend on register values which are unknown.

Mill Computing’s approach

The Mill uses a variant of the first approach to loads, with several key advantages and — (though the company makes no mention of it) — several meaningful disadvantages. Keep in mind that they have also not announced any sort of runahead execution, so their approach needs to handle the entire long-tail distribution of memory stalls.

The specific approach the Mill has changed over time. The initially advertised approach will be discussed here, and the changes they have made will be discussed in a later security section.

Memory reads on the Mill are split into two parts, the initial load that should be scheduled when the address is first generated, and the retire, which occurs either after a fixed number of cycles or on reaching a specific pickup instruction that references the original load . There is also a refuse should the load be unwanted. In order for loads to be scheduled early, over potentially-hazarding stores, waiting loads must check (‘snoop’) outgoing stores. Loads are therefore semantically ordered by the location of their retire, rather than issue. The unit that handles the load and waits for retire is called the retire station. If a memory access faults, a specific ‘NaR’ data value is returned, and no exception is raised.

The fixed-latency load , which produces its value after a fixed cycle count, is appropriate when the consumer is in a known location relative to the producer, which is common since the Mill is statically scheduled. This operation can pass through control flow, but requires all branches to accept a retiring load at the same point, and so is only really intended to be used in branchless code segments or loops. The pickup -based retire is appropriate for diverging control flow, but tags are statically allocated which raises some specific challenges for code with loops or converging control flow that will be discussed later.

Being able to hoist the initial load over other memory operations makes it much easier to schedule early. A load is also allowed to cross a complete function call invocation, such that it retires only after the call is complete. If a called function uses up the limited retire stations (say, 16), load addresses from outer context frames are saved, and reissued upon return, such that the initial load acted only as a prefetch. If the called function does not exhaust the units, the retire station remains active until its natural retire.

Some major issues are visible in this scheme, though some of them take some effort to uncover.

Issue 1: Function boundaries still prevent hoisting loads

Consider the following code:

int f(int x, int *y) {

return x + *y;

} int g(int *x, int *y) {

return f(*x, y);

}

Unless f is inlined, the Mill is unable to make two important transformations:

The load of x inside g must be performed before entry to f . The load of y inside f cannot be performed until after entry to f .

This means that the Mill will never get memory level parallelism from this example. If x and y both miss the cache, you are spending twice the time waiting on memory.

Mill Computing has told me that they do have an approach that tackles this issue which they are not ready to disclose.

Issue 2: Loads potentially in the critical path cannot be delayed

Consider the following code:

inf f(int **x, int **y) {

return **x + **y;

}

Each of the four load retires are allocated to a statically-determined cycle. The two retires on the left hand side must be in different cycles, and vice-versa for the right. All loads can potentially stall. For a given retire from the left hand side, only one load of the right hand side can potentially be active during that cycle. Therefore a stall in a dereference of x will not always be able to cover a stall in a dereference of y , regardless of how the code is compiled.

This may be illustrated by a specific example. Consider if the first load from each side happens simultaneously, and retire together, and then the second load from each side happens simultaneously, and retire together. If the first load on the left hand side, and the second load from the right hand side misses, those misses cannot be overlapped, despite their proximity in the source code and the availability of the address.

This issue has many forms, not just cascaded loads. Let M represent any expensive computation.

int g(int *x, int *y) {

return M(*x) + *M(y);

}

If neither load misses, the two computations should mostly overlap. If both loads miss, the two computations should not overlap. No static schedule on the Mill can work for both cases.

Issue 3: Converging dataflow

Consider the following code:

int f(bool cond, int *x, int y) {

int ret;

if (cond) {

ret = *x;

} else {

ret = 0;

}

return M(y) + ret;

}

The dereference of x would like to be performed in parallel to the evaluation of M(y) . However, only one side of the if performs a load. On the Mill, unifying these branches requires forcing the load, since its value must be available after the branches converge.

Issue 4: It interacts badly with loop pipelining

Loop pipelining is the structuring of a loop to operate over multiple sequential steps of a different iterations of a loop body. Each iteration of the original loop is split across multiple iterations of the pipelined loop, allowing the pipelined loop to operate with the latency of a single stage of the original loop, but the same per-iteration throughput.

The latency-based load retire is intended for this purpose. This fails in two ways. First, and particularly important, the latency of the load is unable to change over the iterations of the loop. Consider the diagram, showing loop iterations phased over multiple cycles.

Optimal schedule when latency is low or loop has few iterations read exec write

read exec write

read exec write

read exec write Optimal schedule when latency is high or loop has many iterations read ... (many cycles) ... exec write

read ... (many cycles) ... exec write

read ... (many cycles) ... exec write

read ... (many cycles) ... exec write

When the Mill schedules like the first example, it is unable to cover very long misses with its loads. When the Mill schedules like the second example, if the loop ended up with few iterations and the load did not miss cache, many cycles would be wasted waiting on a load that had long been available. An out-of-order machine starts off like the first example in early iterations, and transposes into the second example when a cache miss happens, or over many iterations of the loop if not decode limited.

Secondly, any control flow internal to the loop will still prevent a latency-based load . The pickup instruction will have to be used in this case, but it cannot be pipelined over more than a full iteration of the loop because the tags used to force the load are given statically.

Mill Computing’s approach, cont.

Mill Computing advertise their approach to memory accesses as roughly competitive to out-of-order hardware. This has been demonstrated to not be the case; not in the small sense like for control flow, where small modifications adjusted things back onto the right path, but in reaching, seemingly absolutist terms. As far as I can tell, the Mill’s approach to memory simply does not seem to come close to fixing memory’s performance gap between in-order and out-of-order hardware.

Ideas for a better load

At this point you should have a healthy fear of memory accesses. I will offer four angles to tackle this, but, as previously warned, no approach solves the problem, and some approaches may be expensive or infeasible to implement.

Idea 1: Prodigy-Mill hybrid approach

The Tachyum Prodigy claims very impressive numbers from their simulations; a 4.0 GHz Prodigy is purportedly comparable to a 3.5 GHz Xeon. Their primary disclosed strategy for reducing long-tail memory latency is an unspecified variant of runahead. It is not clear what other techniques were used, but they report a degree of success, outperforming significantly an Itanium core on data cache stall counts.

Runahead is a very widely applicable technique, and may be responsible for much of the difference between these approaches, especially as the Itanium’s load instructions were not themselves entirely naïve. Runahead should be possible to implement on a Mill-style CPU, which makes it a decent candidate for inclusion.

Mill Computing have told me they have considered the idea and found that unfortunately it did not expose significant opportunities. This contradicts what I’ve read in the literature, so I consider this an interesting idea to follow up on when it is easier to do so.

Idea 2: Restricted Asynchronous Dataflow

One subset of concerns that seems possible to tackle independently is related to chains of multiple loads. This is very common in part to do with the commonly hierarchical structure of objects, where access to data may require chained dereferences. Dereferences like a->b->c or a[i][j] use only a small subset of operations:

Memory loads,

Pointer addition with registers or small constants,

Multiplication by small, often power-of-2 constants.

The arithmetic here is much simpler than even the most basic arithmetic pipeline. Additionally, loads always have a significant latency; 3 or 4 cycles for top-level cache is common. These make it sound feasible for this case to be handled with dedicated, overprovisioned hardware.

Consider if every retire station output to a specifically-allocated chain of asynchronous ALUs allowing the given set of operations, and that those ALUs were able to be prepared by the main core during the unit’s minimum load latency. Further, allow those ALUs to loop back into the load unit, buffering a chained load, and let this operation happen recursively up to a hardware-defined limit. As an optional extension to the idea, consider allowing the retire station to share any of its produced values with other unit’s ALUs. That allows operations like a->b->c + a->b->d to happen fully asynchronously.

The path back to the address is for chained loads, and latency-critical. The dotted path is for communication between load units, requires some kind of crossbar, and is unlikely to be latency critical. The ALUs are asynchronous and buffer their sideloaded inputs. Buffering in the ALUs and output register is designed to handle multiply-dereferenced values.

The arithmetic capabilities of this design are naturally limited, and even with the optional inter-unit extension the communication between units is very restricted. As a design for a processor, this would be hopeless. However, it seems plausible that that this could be a realizable concept for the restricted cases given, which opens the door for more asynchronous memory-level parallelism without fully out-of-order hardware.

Two design decisions not discussed are how to enforce memory ordering with potentially-conflicting stores, and how the crossbar is controlled.

Idea 3: Source-agnostic argument pickups

Likely the first thing stopping further delay of a pickup instruction is its inability to handle function calls or converging control flow in the general case.

The first baseline approach, performing pickup implicitly on first use, avoids this problem but means the only known serialization point is the initial load . A simple hybrid approach is possible. A preload instruction replaces the current load and produces a tag that resides on the belt, referring to a retire unit in a prefetch state. A load instruction takes a tag to a preload state retire unit and changes it into a load state, which is the same as preload except that the hardware guarantees order of load s is conserved. A pickup instruction takes any belt value; if it is a value, that value is left unchanged, if it is a tag then the corresponding load is forced and the tag is replaced in place with the value.

If functions must pickup all of their arguments before use (with instructions to make that space-efficient), or have a way to signal to the caller which instructions will be picked up, then loads can cross function boundaries. Because the tags reside on the belt at all times, and pickup is idempotent (you don’t need to know if the value requires pickup ), they can handle more arbitrary control flow and pipelining than the standard approach.

Idea 4: Standard prefetch

One of the issues mentioned before was around long-latency instructions in pipelined loops. The solution here is surprisingly simple. The paper Software Prefetching for Indirect Memory Accesses shows how to use prefetch instructions for pipelinable loops. This should be considered mandatory for in-order hardware.

A major concern with prefetch instructions is that many prefetches are incorrect or unnecessary, wasting hardware bandwidth. For prefetches that conflict with automatic hardware prefetching, compiler heuristics should suffice. Other prefetches require more finesse. Prefetching 32 loads ahead in a loop is effective if the loop will last 32 load instructions more, but otherwise the prefetch is wasted memory bandwidth. Hardware heuristics for which prefetches are useful, including perhaps a prefetch warmup phase for loops, is an approach that would follow historic trends. Another possibility is a conditional prefetch instruction, allowing efficient suppression of prefetches that are known to go off the end of a loop.

Security

Computers are systematically vulnerable to side-channel attacks. For example, an operation may take more power with some inputs than others; therefore information about the value is learnt by looking at power usage. The most common and dangerous attacks are timing attacks. Control flow is one of the largest contributors. The largest, memory, is the topic of this section.

Whenever an address a is used to access memory, the state of the cache may be changed in a way that allows side-channel attacks to discover a . Therefore all values used as addresses should be considered leaked.

The typical response to this has been dismissal — deal with it—because side channel attacks are incredibly difficult to mitigate or avoid, and secure keys and encryption could generally manually circumvent these issues with careful coding. Spectre is a particular instantiation of this exploit that can read arbitrary memory from a process in spite of these safeguards, which is why it is considered such a big deal.

A twice-indirect memory access of the form **a leaks *a . It is common for a to be untrusted and restricted to a given range of allowed values. However, if the values to be accessed are guarded only by control flow, speculative execution may still start a load of *a early, leaking data chosen by an adversary.

This is a major problem, invalidating almost all of the approaches mentioned so far in their naïve incarnations. This affects every out-of-order processor I know of, and the mitigations for it are largely incomplete and hacky. Since CG-OoO is just an out-of-order processor from a memory perspective, that too is vulnerable.

More surprisingly, the Mill’s approach is also susceptible to this. Their load instructions are designed to be hoisted before control flow, and efficient compilation exposes the same side channel as seen in out-of-order processors. Mill Computing published a response to the discovery, claiming that “the Mill architecture is fundamentally immune to Spectre-like vulnerabilities” but instead that their vulnerability to Spectre was “a software bug in the Mill tool chain”. For the first time known to me, the conditional load instructions, loadtr and loadfl , are introduced, which allow for non-speculative condition checking.

This response is concerning. Ultimately, no manufacturer of out-of-order cores has responded to Spectre by abandoning speculation. Early mitigations did so at a heavy performance penalty; future processors will solve the hardest issues in hardware. It seems that either you solve this problem in hardware or you become uncompetitive; the Mill seems to be walking the latter path.

There are several approaches that hardware has which seem promising in avoiding Spectre. For instance, one can attempt to make loads delay writes to cache, allowing side-effect free speculation recovery. One can also focus specifically on the known exploitable variants, and prevent only that subset of speculation that leaks attacker-chosen data, such as by preventing the value of loads to be used speculatively, even if the instruction can be issued speculatively.

However, regardless of whether a mitigation exists, fast memory access is so vital to competitive performance that flushing state on context-switch would be a still be a better compromise than avoiding heavy speculation. This is an issue that must be prioritized to a major degree.

Concluding Remarks

Though this article has ended on a negative note, I do not want that to be the main take-away from this post. Rather, consider that the space of possibilities that is available to us is vastly larger than the space of things that has been tried. These architectures might not be perfect, but they’re making real steps towards solving difficult problems. I have given my own creative ideas in this post, but by no means do CG-OoO, the Mill, and my ideas cover every possible choice available.

If there is some more general thing to be taken from this post, perhaps the following would make a good summary.

A. Throughput

The architecture must be able to route instructions efficiently and with minimal latency between a large number of functional units. The architecture should be designed to handle decode and dispatch in significant excess of the available arithmetic ILP.

B. Dependencies and ILP extraction

The architecture must avoid creating false dependencies of any kind. The architecture should have a wide receptive field over the source-level program in order to see the full breadth of executable instructions.

This is a big ask, but both CG-OoO and the Mill make good steps towards them, and the future, though mysterious and uncertain, holds a good deal of promise.