At an ISSCC session this afternoon, Sun's Marc Tremblay took to the podium to describe a processor that's as aggressive and musclebound as the pro wrestler who shares its name. Rock, as Sun's high-end SPARC processor project is called, sports 16 cores (32 threads) and runs a 2.3GHz in a power envelope that hits 250W (hence the "cookin'" part of the title).

Sun took a lot of architectural risks with this chip, some of which are probably at fault for the part's rumored delay, but to go into detail on everything that Tremblay packed into his short talk would expand this post unduly. So I'll just zero in on two aspects of Rock's design: out-of-order retirement and chip-level layout. These unique architectural and organizational features distinguish Rock from the rest of the high-end server fold, and make it one of the most theoretically interesting designs to come down the pike in a very long time.

Retirement, Rock-style

Probably the most interesting aspect of Rock, at least from a CPU geek's perspective, is that the processor features out-of-order retirement. You read that right: instead of reordering the instruction stream, executing it, and then putting it back in program order before retirement, Sun's processor actually retires some instructions out of program order.

Sun's rationale for rethinking the traditional out-of-order pipeline in such a radical manner is that the amount of logic required to track and retire instructions in order balloons to an untenable size if you're trying to use OOO to hide ever increasing memory load latencies. So instead of expanding the instruction window to mammoth proportions, Sun adopted a unique approach that involves forking the instruction stream and executing it twice. Let me explain.



A Rock core's pipeline. Source: Sun

In Sun's new model, instructions enter the front end of the machine, where they're decoded before being dispatched and reordered like normal. But if the instruction stream stalls for a long time while waiting on main memory to load data, Rock first saves the thread's state in a "checkpoint" and then launches a "scout thread" that runs ahead of the (now stalled and saved) main thread. This scout thread (ST) is a hardware entity that's totally invisible to the operating system, hypervisor, or whatever else has control of the processor, and in the course of execution it predicts and resolves branches, prefetches code and data, and saves its speculative state in a shadow register file. Most importantly, the scout thread can actually retire some of the instructions that it speculatively executes.

If an instruction has no dependencies and can be retired out-of-order without causing the program state to be incorrect, the scout thread will go ahead and retire it. Instructions that can't be retired are pushed into a 128-entry "deferred queue," a low-power SRAM structure that saves execution and dependency information. When the main thread finally gets the load data that it has been waiting on and becomes unstuck, it moves along behind the scout thread and executes the instructions that have been put in the deferred queue.

If the scout thread stalls due to a long load latency, then the main thread can actually catch up to it and pass it, at which point it becomes the "scout thread" and the former "scout thread"—now stalled and saved as a checkpoint—becomes the "main thread." The Rock can save up to eight checkpoints per thread, which means that the scout and main threads can fork, join, and leapfrog each other in this fashion up to eight times.

Remixing the execution units/storage/latency relationship

Tremblay claims that the advantage of this "main thread plus scout thread" approach is that it replaces the traditional instruction window with this much smaller deferred queue, a queue that's implemented in a much more power-efficient SRAM structure. Of course, what Tremblay didn't say is that the tradeoff for this is that you actually have to implement a shadow register file for each speculative thread that the processor is capable of executing. Because Rock can save eight checkpoints worth of state per thread, that's at least eight copies of the integer register file per thread (it's not clear to me if Rock does this for floating-point instructions or not, but probably not) for a total of 32 threads x 8 files = 256 copies of the integer register file per chip.

Ultimately, that's a whole lot of register file space required to make this work, and what it means is that Sun has taken a novel approach to reimagining the relationship between execution units, on-chip storage, and memory latency that may or may not pay off. What I mean by that statement is this: instead of just loading up on simple cache space (e.g., Intel's Tukwilla), they've moved more storage out into the chip in the form of these register file copies, and they're using that storage to hide memory latency with these prefetching and speculative execution tricks.

(As an aside, it's worth noting that this degree of dynamic, run-time speculation, and the amount of overhead that it entails, is the exact philosophical opposite of Intel's VLIW approach with Itanium. Itanium relies for its performance on static, compile-time reordering of the instruction stream in combination with loads of cache and execution hardware.)

Rock's layout

The micrograph below shows Rock's overall layout: the 16 cores are grouped in four clusters of four cores each, where the clusters are connected by a central crossbar switch.



Die micrograph of Rock. Source: Sun.

The 65nm Rock has an overall die size of 396mm2, and it dissipates 250W of power at 2.3 GHz. The individual, four-issue cores are only 14mm2 and dissipate 10W each. All four cores in a cluster share among them a number of components that are typically the domain of one core only. Specifically, all four cores in a cluster share a single I-cache, two FPUs, and two data caches.



A Rock core cluster. Source: Sun.

Tremblay claims that the FPU sharing alone saves about 8 percent on die area and about 30W of power.

In the end, I can't say that I'm really sold on Sun's very aggressive use of speculative execution, but I will say that Rock is one of the most interesting and novel processors that I've seen in 10 years of covering this space. In its own way, it's every bit as exotic as IBM's Cell processor, but because all of that exoticism is hidden from the programmer it won't be nearly as difficult for developers to deal with.

Sun's official line is that Rock will be out later this year, but rumored delays put the chip out sometime in 2009.