Intel hasn't been shy about its plans to challenge ARM in the low-power embedded space, and the world's largest chipmaker is gearing up for the debut of the 32nm process that will enable it to reach new levels of x86 power efficiency. But ARM isn't sitting still, and the British IP company took a major step last week in bringing the fight back to Intel by boosting its Cortex A9 processor up into Atom territory. One of the engineers behind Amazon Web Services is even eyeing the part as potential datacenter web server material.

On Wednesday, ARM announced the availability of IP for a 2GHz dual-core A9 processor on TSMC's 40nm process, which the company claims will offer massively more performance than Atom within a smaller power envelope. You'll recall that Cortex A9 is an out-of-order processor, so, unlike the in-order Atom, the A9 should have much better performance per clock on standard integer code. So while some of ARM's claims about the performance delta between its 2GHz A9 and Atom may be overblown, the part at 40nm should therefore be more than competitive with a 32nm Atom in performance per watt.

Unfortunately for ARM's netbook ambitions, Linux is the only netbook OS that matters that runs on ARM, and the jury's still out on whether it can really take on Windows 7. As for Windows on ARM, it just ain't gonna happen, ever. (Same for Mac OS X on ARM... and please read that previous link before writing in to inform me that the ARM-based iPhone runs Mac OS X, or that WinMo runs on ARM.) Given that even the most insanely power-efficient, Atom-smashing 2GHz ARM netbook product is going to be relegated to whatever netbook niche Linux can carve out for it, it's worth asking what sort of future there is for such a high-powered ARM part.

One idea would be servers, believe it or not.

An ARM-based web server

The idea of building cheap but capable web servers from ARM parts has been enthusiastically floated by James Hamilton, a Vice President and Distinguished Engineer on the Amazon Web Services team. In a post earlier this month, Hamilton enthused about the idea of using a multicore, cache-coherent ARM SoC to do low-cost, power-efficient web hosting.

"The ARM is a clear win on work done per dollar and work done per joule for some workloads," Hamilton concludes. "If a 4-core, cache coherent version was available with a reasonable memory controller, we would have a very nice server processor with record breaking power consumption numbers."

Clearly, Wednesday's 2GHz A9 announcement was right up his alley, so he followed up with a post on the part suggesting that it may be what he's looking for. The post features a nice benchmark graph that he got from ARM, showing the 2GHz A9 doubling the performance of a 1.6GHz Atom N270 at EEMBC Coremark.

Those numbers are impressive, but before we talk performance, let's talk price.

Physicalization and Intel's margins

What Hamilton is essentially endorsing is "physicalization," an approach to server design that packs multiple, cheap, low-power systems into a single rack space. The name is a play on "virtualization," because instead of having one large, expensive system running multiple virtual machines, you use a fistful of small, cheap physical machines. The end effect of both is multiple OS instances packed into one rack space.

If you're thinking that physicalization is an odd use of Moore's Law, you're right. The only thing that makes the technique feasible from a performance per dollar perspective is the fact that Intel charges a fat premium for its higher-end server chips. Avoiding that premium is the sole reason that anyone would even consider using board-level integration (i.e., multiple chips and physicalization) instead of die-level integration (i.e., one Xeon and multiple VMs).

But given Intel's markup, and given a robust ARM ecosystem that keeps ARM prices relatively low, physicalization with something like a 2GHz A9 could well deliver more Linux OS instances per dollar than a regular Xeon-based server.

For performance and performance/watt, there's more than just the core

If we grant Hamilton that ARM may turn out to be a cheaper way to pack OS instances of acceptable performance into a datacenter (at least until Intel lowers prices in response), it still doesn't necessarily follow (contra comments I see at Hamilton's blog and elsewhere) that ARM could just pack four or eight A9 cores onto a die, crank up the clockspeed, and slay Intel's Nehalem or AMD's Shanghai in performance/watt at a given absolute performance level. This is because performance/watt numbers are much higher for low-performance processors than they are for high-performance processors (GPUs excepted).

David Kanter at RealWorldTech has recently posted a great article comparing a number of CPUs and GPUs in performance/watt and performance/mm2 (die area). Atom is literally off the charts in performance/watt, besting Nehalem by some 3X. This isn't because Intel sprinkles magical performance/watt pixie dust on Atom—it's because high performance/watt ratios for individual chips are much easier to achieve at Atom-scale than at Xeon-scale, owing to the much larger amount of system-related complexity and overhead that goes with the Xeon's much higher level of integration and performance. As is the case with everything from dinosaurs to automobiles, it just costs more to be bigger and badder, and one of those costs is net energy efficiency.

It's also the case that for raw performance, interconnects and system architecture issues matter a great deal, and they matter more the more cores and other types of resources (like high-bandwidth I/O interfaces) you try to cram onto one die. The minute you put four or eight of any kind of core onto a single die and try to wire it all together with the best cache hierarchy and the optimal mix of I/O and memory bandwidth, then all of the sudden you're trying to solve a much harder problem than you are with a simple dual- or quad-core embedded chip. You're also playing a high-stakes game where one or design mistakes could blow the whole configuration, and you're playing it on Intel's and AMD's home turf.

In the end, the era of cache-coherent multicore is fundamentally different than the single-core era that preceded it, because in that earlier, simpler era core-specific factors like microarchitecture and clockspeed were all that mattered. But nowadays, system design and microarchitecture relate to midrange and high-end multicore processor performance somewhat like oxygen and fuel relate to a flame's heat output—you need both of these elements tuned to give the desired result.

My ultimate point is that any four-core ARM desktop or server processor that shoots at a similar absolute performance target as a four-core Nehalem processor will either look pretty much like a four-core Nehalem, or it won't hit the target. It will also have relatively similar performance/watt characteristics, and will end up competing with Intel and AMD on fab muscle.

Listing image by Atom Smasher (c) DC Comics