Two decades ago, the US high end microprocessor industry was a lively, diverse market where about five various instruction set architectures battled it out across the workstation and server fields. You had choices like DEC’s Alpha – the speed leader; MIPS – the Silicon Graphics heart; SPARC from Sun Microsystems, IBM POWER, HP PA, the nascent X86, and a few custom architectures for MPP massive parallel processing, for instance. The rest of the world pretty much had nothing – British Transputer and German Hyperstone platforms died out due to lack of funding, while ARM was still keeping to the low end embedded arena after the end of the Acorn RISC Machine, a follow up home desktop BBC Micro. The Far East had nothing save NEC / Fujitsu / Hitachi custom vector processors for niche high end machines.

Come back to today: The US server market diversity is a long gone history, right up there in the forgotten mist of lost antediluvian civilizations and such. Basically, during the past decade or so, all you got is X86, specifically Intel Xeons, at whatever development and pricing points Intel decides to offer them. IBM POWER and Oracle SPARC do have some token presence, while ARM, the weakest RISC architecture compute-wise, has some nascent but still immature entries in the server field, mostly far behind Xeons in compute power or bandwidth.

Cross the Big Pond to Middle Kingdom, which was nowhere on the radar two decades ago… What a change: you got about everything the USA had in the Golden ‘90s – there’s Alpha development, right at its rightful No.1 TOP500 place through ShenWei Alpha. There’s LoongSon MIPS in appliance servers, backdoor-free (at least no Western backdoors, that is). There’s PowerCore CP1 (full POWER 8) and coming CP2 (POWER9 with China-specific mods). Then, there are SPARC (FeiTeng) and ARM (Phytium MARS) efforts, both linked to some of China’s leading supercomputers. And to leave this Chinese sugar for the end, Tianjin also now has a company designing AMD Zen derivatives coming soon as well, likely a target for China’s own X86 supercomputer in due course.

The Mars family are, of course, just the most powerful among new ARM-related server processors coming from China: Qualcomm and Huawei are also in the game at the mainstream segment with their more general purpose entries in the coming year. The Phytium Mars CPU is the subject of this story.

Background

Phytium is a company with bases in the southern metropolis of Guangzhou and northern port city of Tianjin, is a result of work that mostly comes from China’s indigenous high-end CPU development teams. These teams have worked on the implementations of various RISC architectures, the most well known among them being FeiTeng SPARC-based I/O interconnect processors with up to 128 threads, used in local supercomputer I/O interconnects to offload the MPI traffic from compute CPUs.

Over time, just like Fujitsu, this team also decided to move from SPARC to ARM for the next generation of their HPC CPUs. The effort had to be be drastically accelerated once it became clear last year that US government will block Intel from selling its high-end CPUs and accelerators to the key China HPC sites – a futile attempt since China was already almost on a par with its own designs by then.

The First Phytium CPU

Phytium’s first ARM design is the “Mars” CPU, the world’s first 64-core ARM processor. Originally first described at HotChips 2015, this year’s subsequent HotChips 2016, where Phytium was a Platinum Sponsor, saw the production system demo of what is right now the fastest ARM CPU, exceeding the 54-core Cavium design and any other server ARM platforms unveiled or known right now, including the AMD one. Also, this year there were no problems in Phytium engineers obtaining U.S. visas, which was one of big scandals of last year.

First, let’s address a few quick points for naysayers not believing the Chinese technology: for start, this is certified fully compliant ARMv8 processor. Whatever runs on any other ARMv8 (save custom extensions) will run on Mars. Secondly, these are not lightweight cores such as those on Xeon Phi – we’re talking about full 4 ops/cycle heavy cores with 128-bit NEON-style SIMD Floating Point Units. Finally, other key server-class features like full ECC paths and very high memory and I/O bandwidth are there as well – all that even though the 28nm process technology is still two generations behind what is used by Intel, for instance.

The initial FT2000/64 SKU is a big 640 mm2 die using a 3000-pin package (FC-BGA). The 2 GHz 64-core model has declared 120W TDP and typical power consumption (akin to AMD MCP) of around 100W. This figure is actually very good for a CPU that either matches or exceeds an 18-core Intel Broadwell E5v4 Xeon in both integer and FP performance, at slightly lower power usage despite the above stated two generation semiconductor process disparity.

Each FTC661 processor core is a 4-instruction per cycle out of order engine with 6 function units and 4 DP FP ops per cycle plus 2×32 KB L1 cache with 4 cycle load-to-use latency, likely the fastest ARM core today by itself. According to their HotChips presentation, the next core will have even greater instruction-per-core parallelism, multithreading and wider SIMD, based on the new ARM V8A extension standard that allows beyond 512-bit SIMD.

The CPU chip itself consist of 16 four-core blocks, each having a 2MB L2 cache (total 32 MB L2 – not L3, mind you!) and directory-based hardware cache coherency able to support even way beyond 64 cores per socket in the future. This also means that, even though this CPU uses a 2-D mesh instead of Intel’s ring bus, the overall latencies are competitive as there are only 2 levels of on-chip cache for similar end capacity compared to 3 on the Xeon.

However, there is still an even larger L3 cache outside here: following and expanding the approach seen in some of the POWER CPUs earlier, Mars has 8 parallel proprietary external memory extension interfaces, each connecting to a controller chip with 16 MB L3 cache and two DDR3-1600 memory buses. In total, you get 128 MB L3 and 16-channel DDR3-1600 memory. While right now the STREAM memory benchmark may give you only about 90 GB/s out of theoretical 204.8 GB/s bandwidth, keep in mind that this figure – at this early stage prior to further optimizations – for a single-socket system is within 15% of Intel’s dual-socket E5 Haswell or Broadwell Xeon system. After all, DDR3 is a proven platform with many latency and bandwidth tuning possibilities.

More importantly, as memory gets tuned, it is realistic to expect 75% efficiency, or about 153 GB/s – on the level of upcoming 32-core AMD “Naples” processors with their 8-channel DDR4-3200 memory controller.

Importantly, with current devices, the CPU supports beyond 1 TB memory per socket – closely matching the current Intel Xeon E5 processors’ memory capacity. And, there’s something else: this extension interface can, if so desired, support direct FPGA or ASIC attachment where that FPGA, like in NVlink or QPI, would have shared memory with the CPU, or even be integrated together with the memory controller. That could open doors to some incredibly scalable accelerator solutions for AI / deep learning or other needs.

The interface itself differs from others of that type like HyperTransport, QPI or NVlink as it is asymmetrical: namely, each read channel is 64-bit while the write channel is 32-bit. This was chosen due to the usual 2:1 read-to-write access proportion in most application. The total bandwidth per channel is 24 GB/s in the first version, planned to go faster in the next revision, together with further main memory bandwidth improvements.

I/O wise, there are 32 PCIe V3 lanes on the CPU – enough to attach a single GPU or their own Matrix DSP accelerator coming soon – or two of them if you have nothing else like storage or networking on them. There comes my wish for the team to, if possible, up the PCIe lane number in the next generation to at least 48 so that dual-GPU plus storage/interconnect configurations come handy. This first Marsh CPU is a single-socket board solution, so there is no second socket option to also add to I/O lanes as what Intel or AMD may have. Of course, it is highly likely that the follow on to Mars will also enable multi-socket board design as well.

This is important as follow-ons to Mars are likely to fully replace Intel Xeon in some of the future near-exascale and exascale China systems. One option is that Phytium’s future successor CPUs choose homogeneous approach with very wide ARM V8A/V9 SIMD/vector units (up to 2048-bit) on the CPU chip, providing multi-TFLOP DP per socket and eliminating the need for accelerators. However, if the accelerators like forthcoming Matrix DSP are still the choice for adding extra TFLOPs, then each CPU should have sufficient connectivity – either PCIe or own shared-memory link – to connect at least two accelerators per socket.

First-look summary

Before starting the real life testing, the first look shows very impressive feature set for their first ARM server CPU iteration, way more ambitious than any current Western ARM server processors. The Japanese Post-K Fujitsu ARM HPC CPU will not come out for quite a while yet, however the Mars follow-on next year will likely have the same SIMD extension standard now adopted by ARM itself.

Talking about that, the actual application performance of the platform will be interesting to observe, as this is the first time that any ARM codes experience such socket-level parallel resources – so, time to have some hands-on effort on it. This where this platform may be interesting as well – as a high-end ARM software development and testing reference.

Potential use in VR/AR

And there we come to the last point in this story. Keep in mind that Mars is a general purpose server CPU for all kinds of server applications, from cloud and Web services to big data and HPC, including some of future Chinese supercomputers. However, nothing stops it from being used as a high-performance 3-D visualization or a VR workstation, either. After all, since majority of mobile consumer VR&AR applications will run on ARM V8-equipped mobile phones and tablets, this high end fully compatible ARM V8 CPU with strong integer, FP and memory capabilities may be an excellent platform to design and stress-test these consumer VR/AR apps to their limit before deploying them in the market.

In the next story, I will have an under-the-hood look at the first Mars reference server system. Stay tuned!