SAN JOSE, Calif. – Startup Cerebras will describe at Hot Chips the world’s largest semiconductor device, a 16nm wafer-sized processor array that aims to unseat Nvidia’s GPUs dominance in training neural networks. The whopping 46,225mm2 die consumes 15kW, packs 400,000 cores, and is running in a handful of systems with at least one unnamed customer.

Also, at this week’s event Huawei, Intel and startup Habana will detail their chips for training neural networks. They all aim to attack Nvidia which last year sold about $3 billion in GPUs for the performance-hungry application.

Intel’s 1.1 GHz Spring Crest aims to stand out from the pack by ganging its 64 28G serdes into 16 112Gbit/second lanes linking up to 1,024 chips. The proprietary interconnect is a direct, protocol-less link that does not need to pass through external HBM2 memory, enabling a relatively fast way to spread large neural networks across multiple processors and chassis.

By putting all its cores, memories and interconnects on one wafer, the Cerebras approach will be even faster and fit in one box.

The startup has raised more than $200 million from veteran investors to be the first to commercialize wafer-scale integration, pioneering new techniques in packaging and wafer handling. It’s betting the AI training market will expand from seven hyperscale data centers to hundreds of companies in everything from pharma to fintech who want to keep their data sets to themselves.

How it works

The Cerebras device packs 84 tiles in a 7×12 array. Each includes about 4,800 cores geared for AI’s sparse linear algebra with 48 KBytes SRAM each, their sole memory source.

The single-level hierarchy speeds processing, enabled by the training app’s little need for memory sharing across cores. The total 18-Gbytes SRAM on the chip is huge compared to a single Nvidia GPU, but small compared to systems Cerebras aims to compete with.

The company will not comment on the frequency of the device which is likely low to help manage its power and thermal demands. The startup’s veteran engineers have “done 2-3 GHz chips before but that’s not the goal here–the returns to cranking the clock are less than adding cores,” said Andrew Feldman, chief executive and a founder of Cerebras.

Feldman wouldn’t comment on the cost, design or roadmap for the rack system Cerebras plans to sell. But he said the box will deliver the performance of a farm of a thousand Nvidia GPUs that can take months to assemble while requiring just 2-3% of its space and power.

The Cerebras device is vastly larger than an Nvidia GPU and any other competing chip for AI training. (Images: Cerebras)

The company aims to describe the system, its performance and benchmarks at the Supercomputer show in November. Attendees there will appreciate its historic significance given the last similar effort was on a 3.5-inch wafer by Trilogy, the 1980’s supercomputer startup of Gene Amdahl.

The Cerebras compiler will ingest a TensorFlow or Pytorch model, convert it to machine language and use microcode libraries to map neural network layers to regions of the giant chip. It does that in part by programming instructions on the cores and configuring the mesh network that links the tiles.

“We will keep the whole network on the chip. Everyone else nibbles at the network and spends time going back and forth” over slower external interconnects often through memory, he said.