This note summarises details of some of the new silicon chips for machine intelligence. Its aim is to distil the most important implementation and architectural details (at least that are currently available), to highlight the main differences between them. I’m focusing on chips designed for training since they represent the frontier in performance and capability. There are many chips designed for inference, but these are typically intended for use in embedded or edge deployments.

In summary:

Cerebras Wafer-Scale Engine

The Cerebras Wafer-Scale Engine (WSE) is undoubtedly the most bold and innovative design to appear recently. Wafer-scale integration is not a new idea, but integration issues to do with yield, power delivery and thermal expansion have made it difficult to commercialise (see the 1989 Anamartic 160 MB solid state disk). Cerebras use this approach to integrate 84 chips with high-speed interconnect, uniformly scaling the 2D-mesh based interconnect to huge proportions. This provides a machine with a large amount of memory (18 GB) distributed among a large amount of compute (3.3 Peta FLOPs peak). It is unclear how this architecture scales beyond single WSEs; the current trend in neural nets is to larger networks with billions of weights, which will necessitate such scaling.

General details:

Announced August 2019.

46,225 mm 2 wafer-scale integrated system (215 mm x 215 mm) om TSMC 16 nm.

wafer-scale integrated system (215 mm x 215 mm) om 16 nm. 1.2T transistors.

Many individual chips: a total of 84 (12 wide by 7 tall).

18 GB total of SRAM memory, distributed among cores.

total of memory, distributed among cores. 426,384 simple compute cores.

Silicon defects can be repaired by using redundant cores and links to bypass a faulty area. It appears that each column includes one redundant core, leaving 410,592 functional cores.

Speculated clock speed of ~1 GHz and 15 kW power consumption.

Interconnect and IO:

Interconnections between chips, across scribe lines, with wiring added in post-processing steps after conventional wafer manufacturing.

IOs brought out on east and west edges of wafer, which is limited by the pad density on each edge. It is unlikely there are any high-speed SerDes since these would need to be integrated in every chip, making a sizeable part of the wafer area redundant apart from chips with edges on the periphery.

2D mesh-based interconnect, supports single-word messages. According to their whitepaper: “The Cerebras software configures all the cores on the WSE to support the precise communication required” indicating that the interconnect is statically configured to support a fixed communication pattern.

to support the precise communication required” indicating that the interconnect is statically configured to support a fixed communication pattern. Interconnect requires static configuration to support specific patterns of communication.

Zeros not transmitted on the interconnect to optimise for sparsity.

Each core:

Is ~0.1 mm 2 of silicon.

of silicon. Has 47 kB SRAM memory.

memory. Zeros not loaded from memory and zeros not multiplied.

Assumed FP32 precision and scalar execution (can’t filter zeros from memory with SIMD ).

precision and scalar execution (can’t filter zeros from memory with ). FMAC datapath (peak 8 operations per cycle).

datapath (peak 8 operations per cycle). Tensor control unit to feed the FMAC datapath with strided accesses from memory or inbound data from links.

datapath with strided accesses from memory or inbound data from links. Has four 8 GB /s bidirectional links to its neighbours.

Each die:

Is 17 mm x 30 mm = 510 mm 2 of silicon.

of silicon. Has 225 MB SRAM memory.

memory. Has 54 x 94 = 5,076 cores (two cores per row/column possibly unused due to repair scheme leaving 4,888 usable cores).

Peak FP32 performance of 40 Tera FLOPs.

References:

Google TPU v3

With few details available on the specifications of the TPU v3, it is likely an incremental improvement to the TPU v2: doubling the performance, adding HBM2 memory to double the capacity and bandwidth.

General details (per chip):

Announced May 2018.

Likely to be 16nm or 12nm.

200W estimated TDP .

. 105 TFLOPs of BFloat16, likely from doubling the MXUs to four.

Each MXU has dedicated access to 8 GB of memory.

has dedicated access to 8 of memory. Integrated in four-chip modules (pictured), 420 TFLOPs peak performance.

IO:

32 GB HBM2 integrated memory with access bandwidth of 1200 GBps (assumed).

integrated memory with access bandwidth of 1200 GBps (assumed). PCIe-3 x8 assumed at 8 GBps.

References:

Google TPU v2

The TPU v2 is designed for training and inference. It improves over the TPU v1 with floating point arithmetic and enhanced memory capacity and bandwidth with HBM integrated memory.

General details (per chip):

Announced May 2017.

Likely to be 20nm.

200-250W estimated TDP .

. 45 TFLOPs of BFloat16.

Two cores with scalar and matrix units.

Also supports FP32 .

. Integrated in four-chip modules (pictured), 180 TFLOPs peak performance.

Each core:

128x128x32b systolic matrix unit ( MXU ) with BFloat16 multiplication and FP32 accumulation.

) with BFloat16 multiplication and accumulation. 8GB of dedicated HBM with access bandwidth of 300 GBps.

of dedicated with access bandwidth of 300 GBps. Peak throughput of 22.5 TFLOPs of BFloat16.

IO:

16 GB HBM integrated memory at 600 GBps bandwidth (assumed).

integrated memory at 600 GBps bandwidth (assumed). PCIe-3 x8 (8 GBps).

References:

Google TPU v1

Google’s first generation TPU was designed for inference only and supports only integer arithmetic. It provides acceleration to a host CPU by being sent instructions across PCIe-3, to perform matrix multiplications and apply activation functions. This is a significant simplification which would have saved much time in design and verification.

General details:

Announced in 2016.

331 mm 2 die on 28nm process.

die on 28nm process. Clocked at 700 MHz and 28-40W TDP .

. 28 MB on-chip SRAM memory: 24 MB for activations and 4 MB for accumulators.

on-chip memory: 24 for activations and 4 for accumulators. Proportions of the die area: 35% memory, 24% matrix multiply unit, 41% remaining area for logic.

256x256x8b systolic matrix multiply unit (64K MACs/cycle).

INT8 and INT16 arithmetic (peak 92 and 23 TOPs/s respectively).

IO:

8 GB DDR3 -2133 DRAM accessible via two ports at 34 GB /s.

-2133 accessible via two ports at 34 /s. PCIe-3 x 16 (14 GBps).

References:

Graphcore IPU

DISCLAIMER: I work at Graphcore, and all of the information given here is lifted directly from the linked references below.

The Graphcore IPU architecture is highly parallel with a large collection of simple processors with small memories, connected by a high-bandwidth all-to-all ‘exchange’ interconnect. The architecture operates under a bulk-synchronous parallel (BSP) model, whereby execution of a program proceeds as a sequence of compute and exchange phases. Synchronisation is used to ensure all processes are ready to start exchange. The BSP model is a powerful programming abstraction because it precludes concurrency hazards, and BSP execution allows the compute and exchange phases to make full use of the chip’s power resources. Larger systems of IPU chips can be built by connecting the 10 inter-IPU links.

General details:

16 nm, 23.6 bn transistors, ~800mm 2 die size.

die size. 1216 processor tiles.

125 TFLOPs peak FP16 arithmetic with FP32 accumulation.

arithmetic with accumulation. 300 MB total on-chip memory, distributed among processor cores, providing an aggregate access bandwidth of 45 TBps.

total on-chip memory, distributed among processor cores, providing an aggregate access bandwidth of 45 TBps. All model state held on chip, there is no directly-attached DRAM .

. 150 W TDP (300 W PCIe card).

IO:

2x PCIe-4 host IO links.

links. 10x inter-card ‘ IPU links’.

links’. Total of 384 GBps IO bandwidth.

Each core:

Mixed-precision floating-point stochastic arithmetic.

Runs up to six program threads.

References:

Habana Labs Gaudi

Habana’s Gaudi AI training processor shares similarities with contemporary GPUs, particularly wide SIMD parallelism and HBM2 memory. The chip integrates ten 100G Ethernet links which support remote direct memory access (RDMA). This IO capability allows large systems to be built with commodity networking equipment, compared with Nvidia’s NVLink or OpenCAPI.

General details:

Announced June 2019.

TSMC 16 nm with CoWoS, assumed die size ~500mm 2 .

16 nm with CoWoS, assumed die size ~500mm . Heterogeneous architecture with: a GEMM operations engine; 8 Tensor Processing Cores (TPCs); a shared SRAM memory (software managed and accessible via RDMA ).

200W TDP for PCIe card and 300W TDP for the mezzanine card.

for PCIe card and 300W for the mezzanine card. Unknown total on-chip memory.

Explicit memory management between chips (no coherency).

TPC core:

VLIW SIMD parallelism and a local SRAM memory.

parallelism and a local memory. Mixed precision: FP32 , BF16 , plus integer formats ( INT32 , INT16 , INT8 , UINT32 , UINT8 ).

, , plus integer formats ( , , , , ). Random number generation.

Transcendental functions: Sigmoid, Tanh, Gaussian error linear unit (GeLU).

Tensor addressing and strided access.

Unknown local memory per TPC .

IO:

4x HBM2 -2000 DRAM stacks providing 32 GB at 1 TBps.

-2000 stacks providing 32 at 1 TBps. 10x 100GbE interfaces are integrated on-chip, supporting RDMA over Converged Ethernet (RoCE v2).

over Converged Ethernet (RoCE v2). IOs are implemented with 20x 56 Gbps PAM4 Tx/Rx SerDes and can also be configured as 20x 50 GbE. This allows up to 64 chips to be connected with non-blocking throughput.

Tx/Rx SerDes and can also be configured as 20x 50 GbE. This allows up to 64 chips to be connected with non-blocking throughput. PCIe-4 x16 host interface.

References:

Huawei Ascend 910

Huawei’s Ascend also bears similarities to the latest GPUs with wide SIMD arithmetic and a 3D matrix unit, comparable to Nvidia’s Tensor Cores, a (assumed) coherent 32 MB shared L2 on-chip cache. The chip includes additional logic for 128 channel video decoding engines for H.264/265. In their Hot Chips presentation, Huawei described overlapping the cube and vector operations to obtain high efficiency and the challenge of the memory hierarchy with ratio of bandwidth to throughput dropping by 10x for L1 cache (in the core), 100x for L2 cache (shared between cores), and 2000x for external DRAM.

General details:

Announced August 2019.

456 mm 2 logic die on a 7+ nm EUV process.

logic die on a 7+ nm process. Copackaged with four 96 mm 2 HBM2 stacks and ‘Nimbus’ IO processor chip.

stacks and ‘Nimbus’ processor chip. 32 DaVinci cores.

Peak 256 TFLOPs (32 x 4096 x 2) FP16 performance, double that for INT8 .

performance, double that for . 32 MB shared on-chip SRAM (L2 cache).

shared on-chip (L2 cache). 350W TDP .

Interconnect and IO:

Cores interconnected in a 6 x 4 2D mesh packet-switched network, providing 128 GBps bidirectional bandwidth per core.

4 TBps access to L2 cache.

1.2 TBps HBM2 access bandwidth.

access bandwidth. 3x 30 GBps inter-chip IOs.

2x 25 GBps RoCE networking interfaces.

Each DaVinci core:

3D 16x16x16 matrix multiply unit providing 4,096 FP16 MACs and 8,192 INT8 MACs.

MACs and 8,192 MACs. 2,048 bit SIMD vector operations for FP32 (x64), FP16 (x128) and INT8 (x256).

vector operations for (x64), (x128) and (x256). Support for scalar operations.

References:

Intel NNP -T

This chip is Intel’s second attempt at an accelerator for machine learning, following the Xeon Phi. Like the Habana Gaudi chip, it integrates a small number of wide vector cores, with HBM2 integrated memory and similar 100 Gbit IO links.

General details:

27 bn transistors.

688 mm 2 die on TSMC 16FF + TSMC with CoWoS.

die on + with CoWoS. 32 GB HBM2 -2400 in four 8 GB stacks integrated on a 1200 mm 2 passive silicon interposer.

-2400 in four 8 stacks integrated on a 1200 mm passive silicon interposer. 60 MB on-chip SRAM memory distributed among cores and ECC protected.

on-chip memory distributed among cores and protected. Up to 1.1 GHz core clock.

150-250W TDP .

. 24 Tensor Processing Cluster ( TCP ) cores.

) cores. TPCs connected in a 2D mesh network topology. Separate networks for different types of data: control, memory and inter-chip communication. Support for multicast.

119 TOPs peak performance.

IO:

1.22 TBps HBM2 bandwidth.

bandwidth. 64 lanes of SerDes with peak 3.58 Tbps aggregate bandwidth (28 Gbps each direction in each lane) for inter-chip IOs.

x16 PCIe-4 host interface (also supports OAM , Open Compute).

TPC core:

2x 32x32 BFloat16 multiplier arrays supporting FMAC operation with FP32 accumulation.

operation with accumulation. Vector FP32 and BFloat16 operations. Support for transcendental functions, random number generation, reductions and accumulations. Programmable FP32 lookup tables.

and BFloat16 operations. A separate convolution engine for non- MAC compute.

compute. 2.5 MB two-port private memory with 1.4 TBps read/write bandwidth.

two-port private memory with 1.4 TBps read/write bandwidth. Memory supports tensor transpose operation.

Communication interface supporting dynamic packet routing on the mesh (virtual channel, reliable transmission).

Scaling:

Up to 1024 chips with direct interconnections, providing the same distributed-memory programming model (explicit memory management, synchronisation primitives, message passing).

Scaling demonstrated up to 32 chips connected in a ring topology.

References:

Nvidia Volta

Volta introduces Tensor Cores, HBM2 and NVLink 2.0, from the Pascal architecture.

General details:

Announced May 2017.

815 mm 2 on TSMC 12nm FFN , 21.1 bn transistors.

on 12nm , 21.1 bn transistors. 300 W TDP ( SXM2 form factor).

( form factor). 6 MB L2 cache.

L2 cache. 84 SMs, each containing: 64 FP32 CUDA cores, 32 FP64 CUDA cores and 8 Tensor Cores (5376 FP32 cores, 2688 FP64 cores, 672 TCs).

cores, 32 cores and 8 Tensor Cores (5376 cores, 2688 cores, 672 TCs). Tensor Cores perform 4x4 FMA , achieving 64 FMA ops/cycle, and 128 FLOPs.

, achieving 64 ops/cycle, and 128 FLOPs. 128 KB L1 data cache/shared memory and four 16K 32-bit registers per SM .

IO:

32 GB HBM2 DRAM , at 900 GBps bandwidth.

, at 900 GBps bandwidth. NVLink 2.0 at 300 GBps.

References:

Nvidia Turing

Turing is an architectural revision of Volta, manufactured on the same 16 nm process, but with fewer CUDA and Tensor cores. It consequently has a smaller die size and lower power envelope. Apart from ML tasks, it is designed to perform real-time ray tracing, for which it also used the Tensor Cores.

General details:

Announced September 2018.

TSMC 12nm FFN , 754 mm 2 die, 18.6 bn transistors.

12nm , 754 mm die, 18.6 bn transistors. 260 W TDP .

. 72 SMs, each containing: 64 FP32 cores, and 64 INT32 cores, 8 Tensor cores (4608 FP32 cores, 4608 INT32 cores and 576 TCs).

cores, and 64 cores, 8 Tensor cores (4608 cores, 4608 cores and 576 TCs). Peak performance with boost clock: 16.3 TFLOPs FP32 , 130.5 TFLOPs FP16 , 261 TFLOPs INT8 , 522 TFLOPs INT4 .

, 130.5 TFLOPs , 261 TFLOPs , 522 TFLOPs . 24.5 MB on-chip memory between 6 MB L2 cache and 256 KB SM register files.

on-chip memory between 6 L2 cache and 256 register files. 1455 MHz base clock.

IO:

12x 32-bit GDDR6 memory providing 672 GBps aggregate bandwidth.

memory providing 672 GBps aggregate bandwidth. 2x NVLink x8 links, each providing up to 26 GBps bidirectional.

References:

Further reading

See this thread on Hacker News for discussion of this note.