The Mill is a new CPU architecture designed for very high single-thread performance within a very small power envelope.

The Mill has a 10x single-thread power/performance gain over conventional out-of-order (OoO) superscalar architectures, yet runs the same programs without rewrite.

The Mill is an extremely-wide-issue, statically scheduled design with exposed pipeline. High-end Mills can decode, issue, and execute over thirty MIMD operations per cycle, each cycle.

The Mill architecture is able to pipeline and vectorise almost all loops, including while loops and loops containing calls and flow control.

The pipeline is very short, with a mispredict penalty of just five cycles.

Processor Models

The Mill is a family of processor models. Models differ in various parameters that specify their performance and power constraints, such as the width of their vector units, the functional units, number of pipelines and belt length.

All Mills can run the same code, however. Code is compiled to an intermediate representation for distribution, and specialisation for a particular Mill processor model is done by a specialising compiler at software installation time or on-demand as needed.

The Belt

The Mill is a belt machine. There are no general-purpose registers.

Functional units (FUs) can read operands from any belt position. All results are placed on the front of the belt.

The belt is a fixed size, and the exact size is dependent on the precise Mill model. When values are inserted at the front of the belt, a corresponding number of old values are pushed off the back end of the belt.

The items on the belt are referenced temporally, by their position relative to the front of the belt, which changes as the belt advances. The belt advances some number of items after each cycle, depending upon how many values were returned by ops finishing that cycle.

For example, add is issued with the operands 2 and 5. This scalar add takes one cycle, so before the next cycle the sum of these two values is inserted at the front of the belt, pushing the oldest item off the back end of the belt.

[See talk: The Belt]

Exposed Pipeline

The number of cycles each operation takes is fixed, and known to the compiler. There are no variable-latency instructions other than load.

A pipeline can be told to do one operation each cycle. These operations take some number of cycles, e.g. an add may take one cycle but a multiplication takes 3 cycles.

Even if the pipeline is part-way through an operation, you can still instruct it to start further operations on subsequent cycles, and it will perform these operations in parallel.

For example, a 3-cycle multiplication is started. In the next cycle, a 1-cycle addition is started. The addition inserts its result at the front of the belt at the end of the cycle, while the multiplication is still ongoing. On the next cycle, the result of the addition is on the belt – and can be used as an input to other operations. The multiplication is nearing completion but the pipeline is ready to start another operation.

If two operations in the pipeline finish on the same cycle, they are inserted onto the belt in FIFO order; the operation that was started first is inserted first.

There are very few examples of hazards between operations, and they are CPU -model specific. It may be that two particular operations cannot execute in parallel inside the pipeline, and if this happens the CPU will detect it and fault. But the compiler knows about these exceptions, and does not try and schedule them.

An operation can return more than one result. For example, an integer divide operation may return both the quotient and the remainder, as two separate values. Furthermore, calls (detailed below) can return more than one value too.

Items that fall off the end of the belt are lost forever. Best belt usage results from producers producing as close to their consumers as possible. Belt items requiring an extended lifetime can be saved in Scratch, detailed below.

[See talk: The Belt]

Scalars and Vectors

All Mill arithmetic and logic operations support vectors. Mill operations are SIMD, and scalar values are vectors of length 1.

Each belt item is tagged with its element width and its scalarity (single or full vector of elements). Elements may be 1, 2, 4, 8 or 16 bytes. The vector length is also a fixed length in bytes and is Mill -model -specific.

If a particular Mill -model does not support any particular widths, these are emulated in software transparently by the compiler.

Vector elements can be marked as missing using a special None metadata marker (detailed below). This allows operations on partial vectors.

Operations support all meaningful combinations of operand scalarity. You can add an single integer to every element in a larger vector, for example, or perform a comparison between two vectors.

The Mill has narrow and widen operations, which increase and decrease the width of each element in the operand respectively. Narrow produces a vector of half-width elements and widen produces two vectors of double-width elements.

The Mill also has widen-versions of add and multiply operations, which return vectors of elements double the width of their widest operands, and therefore cannot overflow.

There is an extract operation for extracting elements from vectors to make scalar belt items, and a shuffle operation for rearranging (or duplicating) the elements within a belt item.

The meta-data does not encode the data types. Code can interpret an element as a pointer, float, fraction, signed integer or unsigned integer. There are separate operations for pointers, floating point, fractions, signed integers and unsigned integers. The compiler knows the type of the operands, and encodes this in the operations it issues.

Operations may have different latencies for different element widths. The compiler knows the size and width of all items on the belt at all times, so it knows the latencies. The operations themselves do not encode the operand sizes and the latencies for most sizes are consistent so the compiler can often share code for templated functions which operate on different sized primitives and reuse a single shared function.

[See talk: Metadata]

Instructions and Pipelines

The functional units in a pipeline can, collectively, issue one operation per cycle. But the Mill has lots of pipelines, and each pipeline can issue an operation each cycle, sustained.

The Mill CPU is a VLIW (Very Long Instruction Word) machine. The operations that issue each cycle are collectively termed the instruction.

Different pipelines have different mixes of FUs available, and the number of pipelines and their FU mix is Mill -model specific. There may be 4 or 8 integer pipelines and 2 or 4 floating point pipelines, for example. There are pipelines that handle flow control and pipelines that handle load and saves. In a high-end Mill “Gold” CPU there are 33 pipelines, so the Mill Gold CPU can issue 33 operations per cycle, sustained:

The pipelines in the Mill Gold CPU are:

8 pipes that do single output “reader” operations

4 pipes that can do either integer (including multiply) or binary floating point (also including multiply) operations

4 pipes that can only do integer (not including multiply) operations

4 pipes that can do either immediate constant, load/store or control transfer (branch, call) operations

4 pipes that can do either immediate constant or load/store operations

4 pipes that can do pick operations

5 pipes that can do “writer” operations

Width, Length and Height

The key parameters that define the Mill CPU are called its Width, Length and Height:

Width is the number of pipelines

Length is the number of belt items

Height is the number of elements in a vector

These parameters vary between models.

Phasing

Like most microprocessors, the Mill has a multi-stage pipeline.

The Mill has fewer stages than conventional architectures such as classic RISC or modern OoO processors which typically have 10-14 stages (the Intel Pentium 4 peaked with 31 stages).

On a conventional microprocessor, each instruction passes through several stages. The individual operations in the instruction pass through the same stages at the same time.

However, the Mill issues the individual operations in an instruction in specific phases and Mill phases in an instruction are issued across several cycles.

There are five phases in the Mill instruction:

Decode happens during the first cycle and the reader operations are issued immediately in that first cycle. The operations for the other phases are decoded in parallel to the reader phase in the first cycle, and issued in the second.

In the second cycle, the main operations such as arithmetic and comparisons are issued.

At the end of the second cycle, two special phases run: function calls are dispatched sequentially and return, and then any pick operations run.

The final writer phase happens in the third cycle.

Operations that depend upon the output of other operations must wait for that operation to be completed. On conventional architectures this means that these operations must be in subsequent instructions and execute in subsequent cycles.

On the Mill, the results of a phase are available immediately to the operations in the next phase in the same cycle, and so dependent operations can be in the same instruction if those operations are in sequential phases.

The Mill can chain up to 6 dependent operations together and execute all of them in a single instruction.

This has advantages over a classic approach in tight loops, for example. The Mill is able to perform a strcpy loop as 27 operations in a single instruction, chained together through phasing.

[See talk: Execution]

Speculative Execution

The Mill can perform operations speculatively because operand elements have special meta-data and can be specially marked as invalid (Not a Result; NaR) or missing (None). Individual elements in vectors can be NaR or None.

The Mill can speculate through errors, as errors are propagated forward and only fault when realised by an operation with side effects e.g. a store or branch.

A load from inaccessible memory does not fault; it returns a NaR. If you load a vector and some of the elements are inaccessible, only those are marked as NaR.

NaRs and Nones flow through speculable operations where they are operands. If an operand element is NaR or None, the result is always NaR or None.

If you try and store a NaR, or store to a NaR address, or jump to a NaR address, then the CPU faults. NaRs contain a payload to enable a debugger to determine where the NaR was generated.

For example: a = b? *p: *q; The Mill is able to load both the values pointed at by p and q speculatively because any bad pointers will return NaR and will only fault if the comparison causes them to be stored to a.

Nones are available as a special constant, used by the compiler.

If you try and store a None, or store to a None address, or jump to None, the CPU does nothing.

For example: if(b) a = *p; The Mill can turn this into a = b? *p: None; which allows *p to be loaded speculatively and executes the store only if the condition is met.

Floating point exceptions are also stored in meta-data flags with belt elements. The exceptions (invalid, divide-by-zero, overflow, underflow and inexact) are ORed in operations, and the flags are applied to the global state flags only when values are realised.

There are operations to explicitly test for None, NaR and floating point exceptions.

Technically, None is a kind of NaR; there are several kinds of NaR and the kind is encoded in the value bits. A debugger can differentiate between memory protection errors and divide by zeros, for example, by looking at the kind bits. The remaining bits in the operand are filled with the low-order-bits of a hash identifying the operation which generated the NaR, so the debugger can usually determine this too even if the NaR has propagated a long way.

The None NaR has a higher precedence over all other kinds of NaR so if you perform arithmetic with NaR and None values the result is always None; None is used to discard and mask-out speculative execution.

[See talk: Metadata]

Vectorising while-loops

[See talk: Metadata]

The Mill can vectorise almost all loops, even if they are variable and conditional iteration and contain flow of control or calls. It can do this because it can mask-out elements in the vector that fall outside the iteration count. The Mill performs boolean masking of vector elements using pick and smear operations:

Pick

The pick operation is a hardware implementation of the ternary if: x = cond? a: b;

Pick is a special operation which takes 0 cycles because it executes at the cycle boundary.

Pick takes a condition operand which it interprets as boolean (all values with the low-order bit set are true) and two data operands to pick between, based on the boolean.

The operands can be vector or scalar. If the data operations are vector and the condition operand is vector, the pick performs a per-element pick.

Smear

The smear operation copies a vector of boolean, smearing the first true value it finds to all subsequent elements. For example, 0,1,0,1 smears to 0,1,1,1. (smear operates on vectors of boolean, rather than bits in a scalar.)

The smearx operation copies a vector of boolean, offset one element. The first element is always set to false, and the second result is an exit flag indicating if any true elements were encountered. For example, 0,1,0,1 becomes 0,0,1,1: 1 and 0,0,0,1 becomes 0,0,0,0: 1.

Strcpy example

Please do not use unsafe functions like C’s strcpy! However, it is an elegant illustration of loop vectorisation. while(*a++ = *b++) {} is straightforward to vectorise on the Mill:

a can be loaded as a vector and b can be stored as a vector. Any elements in a that the process does not have permission to access will be NaR, but this will only fault if we try and store them.

The a vector can be compared to 0; this results in a vector of boolean, which is then smearx’ed. This can then be picked with a vector of None into b. The smearx offsetting ensures that the trailing zero is copied from a to b. The second return from smearx, recording if any 0 was found in a, is used for a conditional branch to determine if another iteration of the vectorised loop is required.

The phasing of the strcpy operations allows all 27 operations to be executed in just one cycle, which moves a full maximum vector of bytes each cycle.

Remaining

The remaining operation is designed for count loops.

Given a scalar, the remaining operation produces a vector of boolean and an exit condition that can be used for picks.

Given an array, the remaining operation produces the offset of the first true element.

For example, strncpy is a straightforward extension of strcpy. It would take the n count argument and use the remaining operation to produce a boolean vector to act as a mask, which it then ORs with the strcpy’s result vector from comparing to 0.

Memory

[See talk: Memory]

The Mill has a 64-bit Single Address Space (SAS) architecture with position-independent code. The advantage of SAS is that address aliasing between processes cannot happen, which allows numerous context transitions to happen in the hardware with no OS intervention and still have fairly simple hardware.

On-CPU caches all use virtual addressing, and virtualisation to main memory pages is performed by a Translation Lookaside Buffer (TLB) between the lowest cache level and main memory. The TLB can divide up DRAM pages on a cache-line granularity. The allocation of virtual memory is done in hardware by the TLB using a free list, and only needs Operating System (OS) intervention if the free list is exhausted.

Memory accesses on the L1 are checked by a high-speed Protection Lookaside Buffer (PLB) which operates on logical address ranges. This is executed in parallel with loads, and masks out inaccessible bytes. The PLB can protect ranges down to byte granularity, although large ranges are more common and efficient.

Pointers

Data pointers can be a full 64-bits but can also be 32-bits using special segment registers, and can be relative. There are special segment registers for the stack, generic code, continuations and for thread-local storage (which also enables efficient green threading).

Code pointers are 32-bit using separate special segment registers, and are usually encoded relative to the EBB entry point, which can be efficiently packed in the instruction encoding.

64-bit pointers have 3 reserved high-order bits which can be used for marking by the application e.g. by garbage collectors. There are distinct operations for pointer manipulation e.g. the addp operation for adding offsets. These operations preserve the high bits and produce NaRs if these bits are overflowed into.

Loads

The Mill has an exposed pipeline architecture. The latency of all operations is fixed and known to the compiler; this includes the latency of loads, because the compiler schedules when it needs the value.

The load instruction specifies the number of future cycles when it will retire, and the hardware has until then to pull the value up through the cache hierarchy. The load latency of top-level data cache is just 3 cycles.

Loads are, in truth, variable latency but the scheduling of loads in the future hides this latency and stalls are uncommon.

An EBB will typically issue its load instructions as early as possible, set to retire close to their consumers, and perform meaningful computations in the intervening cycles.

A pickup load operation specifies a handle rather than a cycle to retire on. Code can use the pickup operation to retire the load at a later point. Pickup loads can be cancelled.

The Mill CPU has a fixed number of retire stations, and called code may evict a caller’s in-flight loads if there are no free retire stations. The caller’s loads are serialised to spill and re-issued when the call returns. Loads in-flight across expensive calls can therefore act as a pre-fetch, bringing values from main memory into cache while the expensive calls are executing, even if those calls evict the caller’s retire stations.

Aliasing

Mill load operations return the value of memory at their retire time (rather than at issue time) in order to allow loads to be issued before potentially aliasing stores, as long as their retire time remains after those stores.

When a store issues, it broadcasts the range it is storing to to all retire stations. If a store range intersects an in-flight load range, the station re-issues its load.

This makes the Mill immune to false aliasing, and as the store ranges are propagated to all cores on a chip, immune to false sharing too.

Scratch

There is a small amount of per-call-frame very fast memory called scratch that can be used to store long-lived values from the belt. This is where code can store values that would otherwise fall off the end of the belt.

Code can explicitly spill and fill to and from scratch, which takes just 3 cycles.

Scratch preserves floating point exception flags, NaR and None markers and other meta-data.

Functions must declare up-front how much scratch space they need for temporary storage, and this space is private to the call; functions cannot see the scratch used by other functions, and their scratch is lost when they return.

Stack

Stack is special address space that is private to the function that allocates it. It is allocated in fixed cache-line lengths using the stackf operation, and is implicitly initialized to zero. When the function returns, the stack is automatically discarded.

Implicit Zero and Backless Memory

Reads from uninitialised memory are automatically zeroed. This includes main memory, and the belt, stack and scratch.

Cache lines and stack lines contain a validity mask and uninitialized values are zeroed by a load.

Reads from memory that is not committed in the TLB returns zeros, without committing the RAM. RAM is only committed on-demand when cache-lines are evicted from the bottom of the cache hierarchy and written to main memory.

Code Blocks and Calls

Code is organised into Extended Basic Blocks (EBBs). Each EBB has exactly one entry point, but may have many exit points. Execution cannot fall off the end of an EBB – there is always terminating branch or return operation.

Execution can only jump – by branches within an EBB, or calls to an EBB – to the entry point.

The Mill has built-in – and therefore accelerated – calling support. There is a call operation, which takes just one cycle. Parameters to the call can be passed in copies of caller belt items (specifying items and their order on the callee’s new belt).

EBBs that are called are functions, and have their own private belt, stack and scratch. EBBs that are branched to inside a function share this belt and scratch.

A function can return multiple values back to the belt of the caller. The caller specifies how many belt items it is expecting to be returned; if this does not match the actual number returned, the thread faults.

Call and branch dispatch takes just one cycle. There is a predictor that is pre-fetching likely-to-be-executed blocks.

From the perspective of the caller, calls behave just like 1 cycle operations. If a caller has some other operations in flight during a call, these retire after the call has returned.

EBBs are a common organisation in compilers, and the Mill target is a near 1:1 translation of the compiler’s EBB and SSA internal representations.

[See talk: The Belt]