Earlier last month we detailed Centaur’s latest CNS core which is part of their new CHA SoC. WikiChip had the opportunity to speak with the Centaur team including Glenn Henry and Al Loper. The CHA SoC is quite a bit different from their earlier chips targetting a whole different market with a different set of capabilities.

CHA

Centaur’s latest SoC, CHA, is a much different product to what they have historically focused on. Instead of the client market, CHA targets servers, edge devices, embedded applications, and other similar devices. At a high level, the new CHA SoC features eight new CNS cores. Those cores are all interconnected over a beefy ring interconnect with each core having its own ring stop. Attached to the same ring is also the memory controller, an inter-chip link, the southbridge, and the NCORE. The most interesting part here is the NCORE, a new home-grown clean-sheet designed neural processor which we will discuss further in this article.

In our prior article, we have detailed the CNS cores. Those are new high-performance cores that, based on the pipeline features and sizes, seems to be around the Zen and Broadwell-level of performance. Centaur spent a lot of effort into ensuring the ISA support is also on par with their competition and to that end, the new CNS cores have support for up to AVX-512 including some of the extensions that were introduced with Cannon Lake including AVX-512 IFMA and VBMI (note that Intel has yet to release server chips with this ISA level of support).



For the full details of the CNS core, see WikiChip’s full article:

Centaur already has silicon back and has made a number of public demonstrations. Current silicon demos operate at 2.5 GHz. It’s worth highlighting that on CHA, everything operates at that frequency – the ring, the x86 cores, and the NCORE. This makes Centaur’s NCORE, as far as we are aware, the highest clocked NPU on the market and by a sizable margin. A lot of effort has gone into ensuring everything meets that sustainable frequency, including AVX-512 workloads.

Asynchronous & Lots Bandwidth

CHA is designed such that NCORE and the CPU cores can work simultaneously and asynchronously. This means there is a lot of traffic going around between the cores, the caches, and the memory. The SoC integrates a ring interconnect that actually comprises two unidirectional rings. Each ring is 512-bits wide and advances one ring stop per clock. As noted earlier, the ring operates at the same frequency as the rest of the chip. At 2.5 GHz, this gives the ring a bidirectional bandwidth of 320 GB/s (2.56 Tb/s). Prior to Skylake, Intel also utilized a ring interconnect for its server SoC and they still make use of the ring for their client processors. Compared to Intel’s implementation, Centaur’s ring is twice as wide. Whereas the Intel’s ring implementation required two packets to transmit a single cache line, Centaur only requires one. The wider interconnect goes a long way to aiding the large traffic on-die, including from the NCORE which requires a lot of weights and a lot of inputs.

CHA has a dedicated ring stop for the memory controller as well. The memory controller supports up to four DDR4 channels operating at up to 3200 MT/s. By comparison, Intel’s Skylake and Cascade Lake feature six memory controllers and AMD Naples and Rome have eight memory controllers. Nonetheless, since Centaur integrated fewer cores, the effective theoretical bandwidth per core is designed to be relatively much higher. Note that that the effective peak bandwidth per core numbers in the table below do include the NCORE as a core.

Memory Bandwidth Designer Centaur Intel AMD SoC CHA Skylake Cascade Lake Naples Rome Max Cores 8+1 28 28 32 64 Max Channels 4 6 6 8 8 Rate 3,200 MT/s 2,666 MT/s 2,933 MT/s 2,666 MT/s 3,200 MT/s Peak BW 102.4 GB/s 128 GB/s 140.8 GB/s 170.6 GB/s 204.8 GB/s Eff/Core BW 11.38 GB/s 4.57 GB/s 5.03 GB/s 5.33 GB/s 3.2 GB/s

NCORE

The most unique feature of Centaur’s new SoC and perhaps its biggest selling point is the highly-capable integrated neural processor, codenamed NCORE. This NPU is a first-class citizen. It sits on the ring interconnect with its own dedicated ring stop and can directly interface with the core L3 cache slices, main memory, and I/O just like any other core. As with the ring and the x86 cores, the NCORE also operates at the full SoC frequency which Centaur demoed at 2.5 GHz which is considerably higher than any other NPU out there as far as we are aware.

The NCORE itself has a very interesting design. It’s a whopping 32,768-bits wide SIMD VLIW machine. It’s a fully clean-sheet in-house design with a lot of specific enhancements for AI applictions. Centaur designed a whole software stack around this processor that uses the standard frameworks most people are already familiar with such as TensorFlow and in the future Pytorch. Centaur’s software will compile the model into two instruction streams – x86 code and the AI coprocessor code. The current implementation already made significant optimizations for x86 which includes optimized AVX and AVX-512 code.

Logically, the NCORE behaves like any other PCIe-attached device, albeit with the advantages (bandwidth, latency) of being integrated on-die. At runtime, the software driver (Linux based) is capable of orchestrating the operations by feeding the NCORE as well as the x86 cores with the instructions and appropriate operations.

At a deeper level, the NCORE receives data from/to the ring at the ring stop. There are two DMA channels that work asynchronously, fetching upcoming data. Data resides in a massive 2-banked 16 MiB cache. What’s interesting about this cache is that it operates at the same frequency as the rest of the NCORE and is capable of reading a full 4,096-byte line from each bank (8K in total!) each cycle. This works out to a whopping 20.5 TB/s of bandwidth and while the program doesn’t have to saturate all that bandwidth, it’s there if it needs it.



For the full details of the NCORE NPU, see WikiChip’s full article:

The NCORE itself has a highly modular design that could be configured and scaled as desired in the future. The NCORE is built using “slices”. Each slice is effectively a 265B SIMD + 1 MiB cache slice. At its current full configuration, the NCORE comprises sixteen slices that produce the full 4K-SIMD + 16 MiB SRAM NCORE. The entire NCORE is 34.4 mmÂ² or roughly 17.6% of the SoC silicon.

The NCORE pipeline is relatively simple, incorporating three major units: a neural data unit, a neural processing unit, and an output unit. Those three units correspond to the usual pre-processing, processing, and post-processing operations required by most AI models. Instructions are fed into the microsequencer instruction memory from main memory or the caches by the device driver or the cores. From here, the NCORE runs on its own and executes the 128-bit instructions across all the slices.

As mentioned earlier, up to 2x4K lines are fed into the data unit each cycle. The neural data unit makes sure the processing unit has ready, processed data, next cycle. The data unit is quite powerful, capable of doing operations such as broadcasting, compression/pooling, and rotating among other specialized functions on the entire 4K-byte line each cycle. Data from here is passed to the neural processing unit which does the usual computations such as the billions of MACs operations required. It also supports various other ALU operations such as shifting and min/max. Results can accumulate in a dedicated 32-bit 4K accumulator that supports both floating-point and integer and saturates on overflows. The neural processor integrates a 4 4K-byte vector register file which is used as both input and outputs as required. For example, you can merge a value from memory from a value from the register and store it back in a register. All those 4K-byte operations, including reading and writing to the registers, can be done in a single cycle.

Following the main processing, data is fed to the output unit which can do the usual post-processing. Here, the NCORE incorporated an activation unit that supports all the typical functions such as TanH and ReLU among many others. This unit can perform other operations such as compression/quantizations to be used in the next convolution.

MLPerf

Late last year MLPerf published the first inference benchmark results. The incredible part is that Centaur had just a month from working silicon to the submission deadline. They managed to get in results for 4 of the five benchmarks. Those results fall in the ‘closed’ competition under the ‘preview’ category (as opposed to the ‘available’ category) since the product is still in pre-production. So we are looking at very early results with room for improvement.

What stands out is the fast inference latency of the NCORE. For example, on MobileNet using ImageNet, Centaur scored the fastest latency of 330 Î¼s. Likewise, on ResNet-50 v1.5 using ImageNet, Centaur scored about 1 ms. This is half the latency of the Nvidia Xavier Jetson. That’s a sizable advantage considering the NCORE is baked into the CHA SoC whereas alternative NPUs have to be acquired separately and attached as accelerator cards. Intel also submitted a number of results using the Xeon Platinum 9200. Those are very powerful chips but they are also incredibly expensive and power-hungry. Nonetheless, the results show that even with this early silicon, the NCORE can match over 23 of the latest cascade lake AP cores with VNNI which run faster than the NCORE and consume a lot more power.

Overall Centaur’s CHA has a very unique product. A high-performance server-class SoC with a “free” and relatively powerful NCORE AI accelerator integrated on-die. For many applications, this can eliminate the need for an external accelerator while providing very good performance at lower power while still offering flexibility with the eight standard x86 cores. CHA should do well in edge applications where you’d want to run an existing x86-based AI software stack.

Derived WikiChip articles: CHA.