After weeks of teasing, ARM unveiled a pile of technical details including pipeline details today at the ARM Technology conference in Santa Clara about its new flagship multiple-processor core called the ARM Cortex-A15 MPCore. Clearly aimed at high-end applications up to and including servers and networking gear, the ARM Cortex-A15 spans a wide performance range, as you can see in this graph.

This graph illustrates several key aspects of the ARM Cortex-A15’s design targets. First, you can see that the intended IC process technology is 28nm and although the ARM Cortex-A15 is fully synthesizable and will therefore work in any process technology. ARM doesn’t expect any SOC designers to put the processor in technology larger than 32nm because the core is roughly twice as big as an ARM Cortex-A9 in a comparable process technology. Second, you can see that the clock frequency for the Cortex-A15 scales from well below 1GHz to about 2.5GHz. ARM is still being cagey about power numbers.

Next, you might want to know about the ARM Cortex-A15’s performance numbers relative to other ARM processor cores. This next graph will help you out.

Perhaps the most relevant comparison here is a 1GHz ARM Cortex-A9 versus a 1GHz ARM Cortex-A15. Here you see that the ARM Cortex-A15 delivers about 40% more integer performance than the ARM Cortex-A9 per MHz. Also note that memory performance is twice as good. Speaking of memory, the ARM Cortex-A15 has parity and ECC on both the L1 and L2 cache. Although ARM added this feature for enterprise-class applications, an obviously experienced enterprise-class architect sitting next to me in the presentation said that unless the internal buses also had parity checking, he could not consider the ARM Cortex-A15 an “enterprise-class” processor core. That’s OK, because this core is clearly suitable for a wide range of applications.

As previously announced, the ARM Cortex-a15 supports 40-bit (1Tbyte) memory addresses. This feature is extremely important for scalable, coherent, multiprocessor systems—which is one of the key target application niches planned for the Cortex-A15 core. However, the 32-bit ARM instruction set isn’t changed so each of the individual processor cores in an ARM Cortex-A15 cluster operates with a 32-bit address space. ARM is providing an SCU (Snoop Sontrol Unit) as part of the integrated L2 cache. One L2 cache/SCU serves one to four CPUs and this block translates the 32-bit processor addresses into 40-bit global addresses. Thefollowing block diagram gives a better idea of how all these blocks relate to each other.

From this block diagram, you can see that four A15 CPUs in a cluster, each with their own L1 caches, talks to the cluster’s lone L2 cache and SCU. The L2 cache and SCU make up a single unit, which translates 32-bit addresses into 40-bit addresses and controls the MPCore’s access to the rest of the system through a 128-bit AMBA 4 bus. ARM has another IP core called the CoreLink CCI-400 Cache Coherent Interconnect that links multiple ARM Coretex-a15 MPCore clusters and can communicate with coherent I/O devices through yet another IP block called the MMU-400. The initial version of the CCI-400 will support connection to one or two ARM Cortex-A15 MPCore clusters and as many as three MMU-400 I/O blocks. I/O blocks can snoop processing cluster caches but processors cannot snoop coherent I/O devices.

When ARM’s architects went looking for ways to improve the performance of their existing processor cores, they focused primarily on instructions per clock (IPC), realizing that the big clock-frequency gains from process scaling are pretty much all in the past. So they focused on the following:

Improved branch prediction

Wider pipelines for higher instruction throughput

A larger instruction window for out-of-order instruction scanning

Adding more instructions to the pool of instructions that could be executed out of order

Better integration of the NEON SIMD vector processor and FPU

Better FPU performance

Better memory performance

Here are the results of that optimization effort.

As you can see, general-purpose performance is nearly double and floating-point performance is nearly 7x. (The presenter admitted that the Cortex-A8’s FPU was relatively slow.)

Enough of the tease! What does this thing’s pipeline look like?

Here it is:

The ARM Cortex-A15 pipeline is divided in two. The first half, 12 stages, consists of a 5-stage fetch pipe and a 7-stage decode/rename/dispatch pipe. At the end of these 12 stages, the instructions are ready to be issued. The ARM Cortex-A15 dispatch unit can issue as many as eight instructions per clock into the execution/writeback stages. There are five types of instruction-execution clusters:

Simple ALU instructions (generally 1 cycle for issue, 1 cycle for execution, 1 cycle for writeback)

Branch instructions (3 cycles for issue/execute/writeback)

NEON/FPU instructions (as many as 12 cycles for issue/execute/writeback)

Multiply (6 cycles for issue/execute/writeback)

Load/Store instructions (6 cycles for issue/execute/writeback)

There are two execution units for simple instructions and the NEON and FPU can also operate separately. There are also two load/store units so the ARM Cortex-A15 can simultaneously execute a load and a store.

The ARM Cortex-A15 talk was standing room only so I sat on the floor. I got a good view of the nearby footware, as you can see.

There’s plenty more information that I’m sure will be rolling out over the next few days. In all, this is a truly impressive offering from ARM. And that’s our “shoe” for tonight.