On Wednesday, ARM formally unveiled its next-generation smartphone processor, the Cortex A7, codenamed “Kingfisher.” But there was much more to the A7?s launch than just the unveiling of a new processor architecture for smartphones. The chip company also announced plans to pair the A7 with the much larger and more powerful Cortex A15 in phones and tablets, using a technique called heterogeneous multiprocessing (or “big.LITTLE," as ARM prefers to call it) to dynamically move lighter workloads from the larger, more power-hungry A15 to the leaner A7 in order to extend mobile battery life.

When used in a dual-core configuration, the A7 will bring the performance characteristics of what is currently a $500 phone to the $100 “feature phones” of 2013. These future feature phones will have the same capabilities as today's high-end smartphones, but they'll have the low prices and long battery life that the feature phone market demands. For the high-end “superphones” and tablets of 2013, the A7 will be paired with the much larger and more powerful A15 core to yield a processor that sips power like a feature phone when all you're doing is some light Web surfing, but can crank up the juice when you're gaming.

The Cortex A7

ARM claims that the A7 will double the performance of its existing Cortex A8 family through a combination of process shrinks and improvements at the level of microarchitecture. Or, as ARM processor division chief Mike Inglis put it at the launch event, “Outpacing Moore's Law with microarchitectural innovation is what we've been working on with A7 as a product.” Though Inglis never mentioned this specifically, I learned that we can actually thank Google's “open” smartphone OS, Android, for some of that innovation.

The A7's design improvements over the older A8 core are possible because ARM has had the past three years to carefully study how the Android OS uses existing ARM chips in the course of normal usage. Peter Greenhalgh, the chip architect behind the A7's design, told me that his team did detailed profiling in order to learn exactly how different apps and parts of the Android OS stress the CPU, with the result that the team could design the A7 to fit the needs and characteristics of real-world smartphones. So in a sense, the A7 is the first CPU that's quite literally tailor-made for Android, although those same microarchitectural optimizations will benefit any other smartphone OS that uses the design.

The high-level block diagram for the A7 released at the event reveals an in-order design with an 8-stage integer pipeline. At the front of the pipeline, ARM has added three predecode stages, so that the instructions in the L1 are appropriately marked up before they go into the decode phase. Greenhalgh told me that A7 has extremely hefty branch prediction resources for a design this lean, so I'm guessing that the predecode phase involves tagging the branches and doing other work to cut down on mispredicts.

(Note that branch prediction is one of the best places to spend transistor resources where you get not only greatly improved performance but also improved power efficiency. The power of branch prediction for boosting performance/watt was one of the major revelations that Intel's Banias (Pentium M) team first brought to the Intel product line. So it makes sense that the A7 has gone all-out here.)

After the decode phase, two instructions per cycle can issue through one of five issue ports to the machine's execution core. This execution core consists of an asymmetric integer arithmetic-logic unit (ALU), where one pipe is a full ALU and the other is limited to simpler operations. There's also a multiply pipe for complex integer operations, a floating-point NEON pipe for floating-point and SIMD ops, and a Load/Store pipe for memory ops.

The feature set for the A7 is identical to that of the Cortex A15—this is critical, because when A7 is paired with A15 in a big.LITTLE configuration the two cores have to be identical from a software perspective.

big.LITTLE: Wave of the future, or compromise?

As important as the launch of a new core design is, ARM's heterogenous multiprocessing plans are perhaps the biggest news to come out of Wednesday's event. big.LITTLE links a dual-core A15 and a dual-core A7 with a cache-coherent interconnect, and it covers the pair with a layer of open-source firmware that dynamically moves tasks among the cores depending on those tasks' performance and power needs.

The OS doesn't actually need to be modified or to be at all aware of the smaller A7 cores in order to take advantage of the technology. All popular mobile and desktop OSes now ship with dynamic voltage and frequency scaling (DVFS) capabilities, so that they can tell the CPU when they need more horsepower and when they need less. For lighter workloads, a typical CPU responds to the OS's signal by throttling back its operating frequency and lowering its frequency and voltage, thereby saving power; for heavier workloads, it can burst the frequency and voltage higher temporarily to provide a performance boost. The open-source firmware layer that will sit between the OS and a big.LITTLE chip can take these standard signals and, instead of downclocking the A15 when the OS asks for less horsepower, it simply moves the workload onto the A7 cores. So while it will be possible to modify an OS to be big.LITTLE-aware, but it's not necessary in order to take advantage of the capability.

To take a step back, there are two ways to look at big.LITTLE. The first way is to go with ARM's angle, which is that heterogenous multiprocessing gives you the best of both worlds by letting you scale processor frequency and voltage to much lower levels than would otherwise be possible by simply moving the load to a leaner core. Take a look at the power vs. performance chart below, and you can see that in the big.LITTLE configuration A7 essentially extends A15's DVFS curve to much lower levels.

I have a lot of sympathy for this angle, because heterogeneous multiprocessing represents a power-efficient use of Moore's Law that is becoming increasingly popular. Heterogenous multiprocessing first cropped up as the Next Big Thing on the server side many years ago with Sun's ill-fated MAJC architecture. Then there was the AMD/ATI merger, at which point AMD started talking about heterogenous multiprocessing and “accelerated processing units”—or APUs—instead of the traditional CPU/GPU division. More recently, Intel has been talking up the potential of heterogeneous multiprocessing in the cloud.

On the client and consumer side, heterogeneous multiprocessing made its big debut in the Playstation 3's Cell processor. More recently, Marvell has been using this approach for over a year, and NVIDIA's “Kal-El” ARM chip uses it as well.

So as an overall approach to boosting power efficiency and even raw performance, ARM's big.LITTLE has been quite thoroughly validated across the industry. Indeed, ARM is actually late to this particular party. All of this is to say that I have no doubt that heterogeneous MP is going to do good things for the smartphone space, because it's one of the most widely recognized Good Ideas for what to do with the embarrassment of cheap transistor riches that Moore's Law has given us.

But then there's my more pessimistic side, which thinks that, in addition to being a good idea, on big.LITTLE is also a bit of a hack that was necessitated by a combination of ARM's server ambitions and its constrained engineering resources. Back before A15 was publicly launched, I began hearing from sources in the semi industry who were privy to the details of the design and who weren't particularly pleased at some of the tradeoffs that ARM made. The scuttlebutt was that ARM was clearly gunning for the cloud server market with this chip—the same “microserver” space that Intel is now attacking with part of its Atom line—and some of A15's design decisions were going to hurt it in the tablet space.

When A15 was unveiled, it was clearly a very robust, full-featured, out-of-order design that was intended to compete in the server and desktop markets with Intel CPUs. Of course, ARM will be able to fit this design into tablets and phones, especially at the right process node. My only point is that the company could have done a more straightforward and mobile-friendly iteration of the A9 if they either 1) didn't have one eye on the server space with A15, or 2) had the resources to do both a full-blown server part and a high-end smartphone/tablet part at the same time.

In this light, big.LITTLE can be seen as ARM's attempt to have its cake and eat it, too. It gets to address the high-end by cramming as much hardware as it can into the A15 while still calling it a mobile design, but in smartphone usage situations where A15 will be overkill the much smaller, leaner A7 will be there to take over and conserve battery life.