For these "decoupled" superscalar x86 processors, register renaming is absolutely critical due to the meager 8 registers of the x86 architecture. This differs strongly from the RISC architectures, where providing more registers via renaming only has a minor effect. Nonetheless, with clever register renaming, the full bag of RISC tricks become available to the x86 world, with the two exceptions of advanced static instruction scheduling (because the micro-instructions are hidden behind the x86 layer and thus are less visible to compilers) and the use of a large register set to avoid memory accesses (because x86 only has 8 architecturally-visible registers).

The solution, invented independently (at about the same time) by engineers at both NexGen and Intel, was to dynamically decode the x86 instructions into simple, RISC-like micro-instructions, which can then be executed by a fast, RISC-style register-renaming OOO superscalar core.

While the Pentium, a superscalar x86, was an amazing piece of engineering, it was clear that the big problem was the complex and messy x86 instruction set. Complex addressing modes and a minimal number of registers meant that few instructions could be executed in parallel due to potential dependencies. For the x86 camp to compete with the RISC platforms, they needed to find a way to "get around" the x86 instruction set.

So where does x86 fit into all this, and how have Intel and AMD been able to remain competitive through all of these developments in spite of an architecture that's now more than 20 years old?

No matter which route is taken, the key problem is still the same  normal programs just don't have a lot of fine-grained parallelism in them. A 4-issue superscalar processor requires four independent instructions to be available, with all their dependencies and latencies met, at every cycle. In reality this is virtually never possible, especially with load latencies of two or three cycles (and increasing with every processor generation). Currently, real-world instruction-level parallelism for mainstream applications is limited to about 2 instructions per cycle at best. Certain types of applications do exhibit more parallelism, such as scientific code, but these are generally not representative of mainstream applications. There are also some types of code, such as pointer chasing, where even sustaining 1 instruction per cycle is extremely difficult. For those programs, the key problem is the memory system (which we'll get to later).

DEC, for example, went primarily speed-demon with the first two generations of Alpha, then changed to brainiac for the third generation. MIPS did similarly. Sun, on the other hand, went brainiac with their first superscalar then switched to speed-demon for more recent designs. The PowerPC camp has also gradually moved away from brainiac designs over the years, although the reservation stations in all PowerPC designs do offer a degree of OOO execution between different functional units even if the instructions within each functional unit's queue are executed strictly in order. Intel is sort-of going both ways at once  x86 processors have no choice but to be at least somewhat brainiac due to limitations of the x86 architecture (though the Pentium-4 is about as speed-demon as possible for a decoupled x86 microarchitecture), but with IA64, Intel is betting solidly on the smart-compiler approach, with a simple but very wide design relying totally on static scheduling (at least in the first two generations).

When it comes to the brainiac debate, many vendors have gone down one path then changed their mind and switched to the other side...

Exactly which is the more important factor is currently open to hot debate. At present, it seems that both the benefits and the costs of OOO execution have been somewhat overstated. In terms of cost, appropriate pipelining of the dispatch and register renaming logic has allowed OOO processors to achieve clock speeds competitive with simpler designs  the aggressively OOO Alpha 21264, moderately OOO PowerPC G4e, and simpler in-order UltraSPARC-III are all available at comparable clock speeds, for example. This is a testament to some outstanding engineering by processor architects. Unfortunately, however, the effectiveness of OOO execution in dynamically extracting additional instruction-level parallelism has been disappointing, with only a relatively small improvement being seen 3 . OOO execution has also been unable to deliver the degree of schedule-insensitivity originally hoped for, with recompilation still producing large speedups even on aggressive OOO processors such as the MIPS R10000 and Alpha 21264.

Brainiac designs are at the smart-machine end of the spectrum, like IBM's POWER2 and the MIPS R10000, whereas speed-demon designs like the Alpha 21164 and UltraSPARC rely on a smart compiler. Clearly, OOO hardware should make it possible for more instruction-level parallelism to be extracted, because things will be known at runtime that cannot be predicted in advance (cache misses, for example). On the other hand, an in-order design is potentially able to run at faster clock speeds at any given point in time, due to reduced design complexity.

Whether compilers can do the task of instruction scheduling well enough or not is a hot question at the moment in the hardware industry. This is called the brainiac vs speed-demon debate. This simple (and fun) classification of design styles first appeared in a 1993 Microprocessor Report editorial by Linley Gwennap, and was made widely known by Dileep Bhandarkar's Alpha Implementations & Architecture book.

Most of the early superscalars were in-order designs (SuperSPARC, hyperSPARC, UltraSPARC, Alpha 21064 & 21164). Examples of OOO designs include the MIPS R10000, Alpha 21264 and to some extent the entire POWER/PowerPC line (with their reservation stations). UltraSPARC-III is the most notable recent design which has stayed in-order and not added any OOO execution hardware.

The compiler approach also has some other advantages over OOO hardware  it can see further down the program than the hardware, and it can speculate down multiple paths rather than just one (a big issue if branches are unpredictable). On the other hand, a compiler can't be expected to be psychic, so it can't necessarily get everything perfect all the time. Without OOO hardware, the pipeline will stall when the compiler fails to predict something like a cache miss.

Another approach to the whole problem is to have the compiler optimize the code by rearranging the instructions (called static, or compile-time, instruction scheduling). The rearranged instruction stream can then be fed to a processor with simpler in-order multiple-issue logic, relying on the compiler to "spoon feed" the processor with the best instruction stream. Avoiding the need for complex OOO logic should make the processor quite a lot easier to design, and should potentially make it easier to ramp up the clock speed over time.

All of this dependency analysis, register renaming and OOO execution adds a lot of complex logic to the processor, making it harder to design and potentially harder to ramp up the clock speed over time. On the other hand, it offers the advantage that software need not be recompiled to get at least some of the benefits of the new processor's design (though typically not all).

If the processor is going to execute instructions out of order, it will need to keep in mind the dependencies between those instructions. This can be made easier by not dealing with the raw architecturally-defined registers, but instead using a set of renamed registers. For example, a store of a register into memory, followed by a load of some other piece of memory into the same register, represent different values and need not go into the same physical register. Furthermore, if these different instructions are mapped to different physical registers they can be executed in parallel, which is the whole point of OOO execution. So, the processor must keep a mapping of the instructions in flight at any moment and the physical registers they use. This process is called register renaming. As an added bonus, it becomes possible to work with a potentially larger set of real registers in an attempt to extract even more parallelism out of the code.

There are two ways to do this. One approach is to do the reordering in hardware at runtime. Doing dynamic instruction scheduling (reordering) in the processor means the dispatch logic must be enhanced to look at groups of instructions and dispatch them out of order as best it can to use the processor's functional units. Not surprisingly, this is called out-of-order execution, or just OOO for short (sometimes written OoO or OOE).

If branches and long latency instructions are going to cause bubbles in the pipeline(s), then perhaps those empty cycles can be used to do other work. To achieve this, the instructions in the program must be reordered so that while one instruction is waiting, other instructions can execute. For example, it might be possible to find a couple of other instructions from further down in the program and put them between the two instructions in the earlier multiply example.

Interestingly, the ARM architecture used in many portable handheld devices was the first architecture with a fully predicated instruction set. This is even more intriguing given that the ARM processors only have short pipelines and thus relatively small mispredict penalties.

The Alpha architecture has had a conditional move instruction since the very beginning. MIPS, SPARC and x86 added it later. With IA64, however, Intel has gone all-out and made almost every instruction predicated in the hope of dramatically reducing branching problems in inner loops, especially ones where the branches are unpredictable (such as compilers and OS kernels). It will be interesting to see whether this works out well or not for IA64.

Of course, if the blocks of code in the if and else cases were longer, then using predication would mean executing more instructions than using a branch, because the processor is effectively executing both paths through the code. Whether it's worth executing a few more instructions to avoid a branch is a tricky decision  for very small or very large blocks the decision is simple, but for medium-sized blocks there are complex trade-offs which the optimizer must consider.

Given this new predicated move instruction, two instructions have been eliminated from the code, and both were costly branches. In addition, by being clever and always doing the first mov then overwriting it if necessary, the parallelism of the code has also been increased  lines 1 and 2 can now be executed in parallel, resulting in a 50% speedup (2 cycles rather than 3). Most importantly, though, the possibility of getting the branch prediction wrong and suffering a large mispredict penalty has been eliminated.

Here, a new instruction has been introduced called cmovle , for "conditional move if less than or equal". This instruction works by executing as normal, but only commits itself if its condition is true. This is called a predicated instruction because its execution is controlled by a predicate (a true/false test).

Consider the above example once again. Of the five instructions, two are branches, and one of those is an unconditional branch. If it was possible to somehow tag the mov instructions to tell them to execute only under some conditions, the code could be simplified...

Conditional branches are so problematic that it would be nice to eliminate them altogether. Clearly, if statements cannot be eliminated from programming languages, so how can the resulting branches possibly be eliminated? The answer lies in the way some branches are used.

Unfortunately, even the best branch prediction techniques are sometimes wrong, and with a deep pipeline many instructions might need to be cancelled. This is called the mispredict penalty. The Pentium-Pro/II/III is a good example  it has a 12+ stage pipeline and thus a mispredict penalty of 10-15 cycles. Even with a clever dynamic branch predictor that correctly predicts an impressive 90% of the time, this high mispredict penalty means about 30% of the Pentium-Pro/II/III's performance is lost due to mispredicts. Put another way, one third of the time the Pentium-Pro/II/III is not doing useful work but instead is saying "oops, wrong way".

The other alternative is to have the processor make the guess at runtime. Normally, this is done by using an on-chip branch prediction table containing the addresses of recent branches and a bit indicating whether each branch was taken or not last time. In reality, most processors actually use two bits, so that a single not-taken occurrence doesn't reverse a generally taken prediction (important for loop back edges). Of course, this dynamic branch prediction table takes up valuable space on the processor chip, but branch prediction is so important that it's well worth it.

The key question is how the processor should make the guess. Two alternatives spring to mind. First, the compiler might be able to mark the branch to tell the processor which way to go. This is called static branch prediction. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead (such as backward branches are predicted to be taken while forward branches are predicted not-taken). More importantly, however, this approach requires the compiler to be quite smart in order for it to make the correct guess, which is easy for loops but might be difficult for other branches.

So the processor must make a guess. The processor will then fetch down the path it guessed and speculatively begin executing those instructions. Of course, it won't be able to actually commit (writeback) those instructions until the outcome of the branch is known. Worse, if the guess is wrong the instructions will have to be cancelled, and those cycles will have been wasted. But if the guess is correct the processor will be able to continue on at full speed.

Now consider a pipelined processor executing this code sequence. By the time the conditional branch at line 2 reaches the execute stage in the pipeline, the processor must have already fetched and decoded the next couple of instructions. But which instructions? Should it fetch and decode the if branch (lines 3 & 4) or the else branch (line 5)? It won't really know until the conditional branch gets to the execute stage, but in a deeply pipelined processor that might be several cycles away. And it can't afford to just wait  the processor encounters a branch every six instructions on average, and if it was to wait several cycles at every branch then most of the performance gained by using pipelining in the first place would be lost.

Latencies for memory loads are particularly troublesome, in part because they tend to occur early within code sequences, which makes it difficult to fill their delays with useful instructions, and equally importantly because they are somewhat unpredictable  the load latency varies a lot depending on whether the access is a cache hit or not (we'll get to caches later).

From a compiler's point of view, typical latencies in modern processors range from a single cycle for integer operations, to around 3-6 cycles for floating-point addition and the same or perhaps slightly longer for multiplication, through to over a dozen cycles for integer division.

The number of cycles between when an instruction reaches the execute stage and when its result is available for use by other instructions is called the instruction's latency 2 . The deeper the pipeline, the more stages and thus the longer the latency. So a very deep pipeline is not much more effective than a short one, because a deep one just gets filled up with bubbles thanks to all those nasty instructions depending on each other.

If the first instruction was a simple integer addition then this might still be okay in a pipelined single issue processor, because integer addition is quick and the result of the first instruction would be available just in time to feed it back into the next instruction (using bypasses). However in the case of a multiply, which will take several cycles to complete, there is no way the result of the first instruction will be available when the second instruction reaches the execute stage just one cycle later. So, the processor will need to stall the execution of the second instruction until its data is available, inserting a bubble into the pipeline where no work gets done.

The second instruction depends on the first  the processor can't execute the second instruction until after the first has completed calculating its result. This is a serious problem, because instructions that depend on each other cannot be executed in parallel. Thus, multiple issue is impossible in this case.

How far can pipelining and multiple issue be taken? If a 5 stage pipeline is 5 times faster, why not build a 20 stage superpipeline? If 4-issue superscalar is good, why not go for 8-issue? For that matter, why not build a processor with a 50 stage pipeline which issues 20 instructions per cycle?

No VLIW designs have yet been commercially successful, however Intel's IA64 architecture, which is now in production in the form of the Itanium-I/II processors, was intended to be the replacement for x86 (and may still end up that way, although this is looking increasingly unlikely). Intel chose to call IA64 an "EPIC" design, for "explicitly parallel instruction computing", but it's basically just a VLIW with clever grouping (to allow long-term compatibility) and predication (see below). Many graphics processors (often called GPU's) can in some ways be considered VLIW-like (although obviously they only provide single-purpose instruction sets), and there's also Transmeta (see the x86 section, coming up soon).

It is worth noting, however, that most VLIW designs are not interlocked. This means they do not check for dependencies between instructions, and often have no way of stalling instructions other than to stall the whole processor on a cache miss. As a result, the compiler needs to insert the appropriate number of cycles between dependent instructions, even if there are no instructions to fill the gap, by using nops (no-operations) if necessary. This complicates the compiler somewhat, because it is doing something that a superscalar processor normally does at runtime, however the extra code in the compiler is minimal and it saves precious resources on the processor chip.

Other than the simplification of the dispatch logic, VLIW processors are much like superscalar processors. This is especially so from a compiler's point of view (more on this later).

A VLIW processor's instruction flow is much like a superscalar, except that the decode/dispatch stage is much simpler and only occurs for each group of sub-instructions...

In this style of processor, the "instructions" are really groups of little sub-instructions, and thus the instructions themselves are very long (often 128 bits or more), hence the name VLIW  very long instruction word. Each instruction contains information for multiple parallel operations.

In cases where backward compatibility is not an issue, it is possible for the instruction set itself to be designed to explicitly group instructions to be executed in parallel. This approach eliminates the need for complex dependency checking logic in the dispatch stage, which should make the processor easier to design (and easier to ramp up the clock speed over time, at least in theory).

The issue-widths of current processors range from 2-issue (MIPS R5000) to 3-issue (PowerPC G3/G4, Pentium-Pro/II/III/M, Athlon, Pentium-4 (well, sort-of)) or 4-issue (UltraSPARC, MIPS R10000, Alpha 21164 & 21264, PowerPC G4e) or 5-issue (PowerPC G5), or even 6-issue (Itanium-I/II, but it's a VLIW  see below). The exact number and type of functional units in each processor depends on its target market. Some processors have more floating-point execution resources (MIPS R8000, IBM's POWER line, Athlon 1 ), others are more integer-biased (Pentium-Pro/II/III/M, PowerPC G3), some devote much of their resources towards SIMD vector instructions (PowerPC G4 & G4e), and many take the middle ground (UltraSPARC, MIPS R10000, Alpha 21164 & 21264, Pentium-4, Itanium-I/II, PowerPC G5).

Of course, there's nothing stopping a processor from having both a deep pipeline and multiple instruction issue, so it can be both superpipelined and superscalar at the same time...

The IBM POWER1 processor, the predecessor of PowerPC, was the first mainstream superscalar processor. Most of the RISC's went superscalar soon after (SuperSPARC, Alpha 21064). Intel even managed to build a superscalar x86  the original Pentium  however the complex x86 instruction set was a real problem for them (more on this later).

Note that the issue-width is less than the number of functional units  this is typical. There must be more functional units because different code sequences have different mixes of instructions. The idea is to execute 3 instructions per cycle, but those instructions are not always going to be 1 integer, 1 floating-point and 1 memory operation, so more than 3 functional units are required.

This is great! There are now 3 instructions completing every cycle (CPI = 0.33, or IPC = 3). The number of instructions able to be issued or completed per cycle is called a processor's width.

In the above example, the processor could potentially execute 3 different instructions per cycle  for example one integer, one floating-point and one memory operation. Even more functional units could be added, so that the processor might be able to execute two integer instructions per cycle, or two floating-point instructions, or whatever the target applications could best use.

Of course, now that there are independent pipelines for each functional unit, they can even be different lengths. This allows the simpler instructions to complete more quickly, reducing latency (which we'll get to soon). There are also a bunch of bypasses within and between the various pipelines, but these have been left out for simplicity.

Since the execute stage of the pipeline is really a bunch of different functional units, each doing its own task, it seems tempting to try to execute multiple instructions in parallel, each in its own functional unit. To do this, the fetch and decode/dispatch stages must be enhanced so that they can decode multiple instructions in parallel and send them out to the "execution resources"...

Today, most processors strive to keep the number of levels of logic down to just a handful for each pipeline stage (about 10-20 levels), and most have quite deep pipelines (4-7 in PowerPC G3/G4, 5-7 in MIPS R10000, 7-9 in Alpha 21164, 7-12 in PowerPC G4e, 8-10 in Itanium-II, 9 in UltraSPARC, 10-15 in Athlon, 12+ in Pentium-Pro/II/III, 14 in UltraSPARC-III, 16-25 in PowerPC G5, 20+ in Pentium-4). The x86 processors generally have deeper pipelines than the RISC's because they need to do extra work to decode the x86 instructions (more on this later).

The Alpha architects in particular liked this idea, which is why the early Alpha's had very deep pipelines and ran at such very high clock speeds for their era. The MIPS R4000 series was also superpipelined (in fact the R4000 was only superpipelined and not superscalar, see below).

Since the clock speed is limited by (among other things) the length of the longest stage in the pipeline, the logic gates that make up each stage can be subdivided, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages. Then the whole processor can be run at a higher clock speed! Of course, each instruction will now take more cycles to complete (latency), but the processor will still be completing 1 instruction per cycle (throughput), and there will be more cycles per second, so the processor will complete more instructions per second (actual performance)...

The early RISC processors, such as IBM's 801 research prototype, the MIPS R2000 (based on the Stanford MIPS machine) and the original SPARC (derived from the Berkeley RISC project), all implemented a simple 5 stage pipeline not unlike the one shown above (the extra stage is for memory access, placed after execute). At the same time, the mainstream 80386, 68030 and VAX processors worked sequentially using microcode (it's easier to pipeline a RISC because the instructions are all simple register-to-register operations, unlike x86, 68k or VAX). As a result, a SPARC running at 20 MHz was way faster than a 386 running at 33 MHz. Every processor since then has been pipelined, at least to some extent. A good summary of the original RISC research projects can be found in this 1985 CACM research paper by David Patterson.

Although the pipeline stages look simple, it is important to remember that the execute stage in particular is really made up of several different groups of logic (several sets of gates), making up different functional units for each type of operation that the processor must be able to perform...

Since the result from each instruction is available after the execute stage has completed, the next instruction ought to be able to use that value immediately, rather than waiting for that result to be committed to its destination register in the writeback stage. To allow this, forwarding lines called bypasses are added, going backwards along the pipeline...

At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a pipeline latch, and this information forms the inputs to the logic circuits of the next pipeline stage. During the clock cycle, the signals propagate through the combinatorial logic of the stage, producing an output just in time to be captured by the next pipeline latch at the end of the clock cycle...

From the hardware point of view, each pipeline stage consists of some combinatorial logic and possibly access to a register set and/or some form of high speed cache memory. The pipeline stages are separated by latches. A common clock signal synchronizes the latches between each stage, so that all the latches capture the results produced by the pipeline stages at the same time. In effect, the clock "pumps" instructions down the pipeline.

Modern processors overlap these stages in a pipeline, like an assembly line. While one instruction is executing, the next instruction is being decoded, and the one after that is being fetched...

Consider how an instruction is executed  first it is fetched, then decoded, then executed by the appropriate functional unit, and finally the result is written into place. With this scheme, a simple processor might take 4 cycles per instruction (CPI = 4)...

Instructions are executed one after the other inside the processor, right? Well, that makes it easy to understand, but that's not really what happens. In fact, that hasn't happened since the middle of the 1980's. Instead, several instructions are all partially executing at the same time.

How can this be? Obviously, there's more to it than just clock speed  it's all about how much work gets done in each clock cycle. Which leads to...

A 200 MHz MIPS R10000, a 300 MHz UltraSPARC and a 400 MHz Alpha 21164 are all about the same speed at running most programs, yet they differ by a factor of two in clock speed. A 300 MHz Pentium-II is also about the same speed for many things, yet it's about half that speed for floating-point code such as scientific number crunching. A PowerPC G3 at that same 300 MHz is somewhat faster than the others for integer code, but still far slower than the top 3 for floating-point. At the other extreme, an IBM POWER2 processor at just 135 MHz matches the 400 MHz Alpha 21164 in floating-point speed, yet it's only half as fast for normal integer programs.

The first issue that must be cleared up is the difference between clock speed and a processor's performance. They are not the same thing. Look at the results for processors of the recent past...

But be prepared  this article is brief and to-the-point. It pulls no punches and the pace is pretty fierce (really). Let's get into it...

Fear not! This article will get you up to speed fast. In no time you'll be discussing the finer points of superscalar-vs-VLIW, the brainiac debate and its relationship to IA64 and Itanium.

Okay, so you're a CS graduate and you did a hardware/assembly course as part of your degree, but perhaps that was a few years ago now and you haven't really kept up with the details of processor designs since then.