New 120-core Machine Learning Processor Shows Early Promise

by Jim Turley

“Nothing matters very much, and few things matter at all.” — Arthur Balfour

One thing that all machine learning developers agree upon: ML requires lots and lots of data. New ML processors have dozens – sometimes hundreds – of processor cores, huge caches, wide buses, and enormous appetites for bandwidth. The secret of ML is not that it’s so radically different from normal computers, it’s that it processes so much more data.

Like the real estate agents’ mantra, the most important features are bandwidth, bandwidth, and bandwidth.

But what if… just suppose… you could reduce the amount of data, and therefore the awesome bandwidth, required to process ML data sets? Instead of making behemoth processors, what if we reduced the workload instead? What if we worked smarter, not harder?

That’s the unique selling proposition of Tenstorrent, a 60-person ML startup that aims to make the smartest – as well as the fastest – ML processors you’ve ever seen. Plus, Tenstorrent’s chips will scale from small IoT “at the edge” to massive cloud datacenter systems, all with the same architecture. It’s a daunting task, but then, so is everything in the machine learning world.

Ljubisa Bajic, the CEO and founder of Tenstorrent, is an ex-AMD processor architect, as are a lot of his technical staff. He says the big problem with ML inference is the size of the models. Data sets are just too $^@# big for hardware to reasonably handle. Models like Google’s BERT need about 680 MB; Microsoft’s Turing NLG requires 34 GB, a 50-fold increase over several months. And that’s just the start. Bajic says 1 TB models are on the horizon, with no end to the inflation in sight. It’s gonna be tough for any chip, network, or memory technology to consume, process, and transport that much data.

So, what’s the alternative? Eliminate the white space.

if(get_post_type() !== 'chalk_talks') { ? } ?

The best way to reduce a workload is to actually reduce the workload. That is, remove the unnecessary, and compress or compact what remains. Just as JPEG image compression eliminates runs of redundant pixels, Tenstorrent says there are massive gains to be had simply by removing unnecessary data. Then you can apply vast hardware resources to what remains.

He gives the example of natural language processing, a common ML task. The BERT algorithm for this is reasonably well understood, and it makes a good benchmark for emerging ML players. But, says Bajic, the problem is that BERT converts words and sentences into fixed-length tokens which it stores in fixed-length structures. This makes for tidy data handling, but it also wastes space – and computing resources. An informal poll of BERT users (it’s open-sourced) revealed that as much as 80% of BERT’s dataset was simply padding. “We see fairly massive amounts of inefficiency,” says Bajic.

One way to fix this (which is not unique to Tenstorrent) is to implement an “early exit” from the processing loop once confidence in the result has passed some threshold. One early exit will typically decrease processing time with virtually no change in accuracy. Adding multiple early exits, as Tenstorrent does, promises even better gains. It’s notable that none of this requires any changes to the model itself or to the dataset. It’s purely an implementation feature that’s invisible to software, like branch prediction or caching.

Like many AI startups, Tenstorrent draws a comparison between its architecture and the human brain. “The brain only consumes about 20 watts,” says Bajic, “about the same as a low-power light bulb.” That’s because most of it’s not being used, and neurons – “a simple processor with branch support and a good network” – are activated only when they’re being used. Furthermore, the brain is good at drawing quick conclusions to avoid unnecessary additional processing.

Tenstorrent’s hardware, then, combines a cluster of small processors with mesh networks that emulate neurons and synapses. Initial chips have 120 processing nodes and two orthogonal toroidal networks that can be extended off-chip to create large chip-to-chip clusters. Ultimately, Tenstorrent envisions rack-sized clusters with hundreds of its chips containing thousands of processors.

What the system doesn’t have is any shared memory, because that’s antithetical to Tenstorrent’s architecture. Shared memory doesn’t scale (at least, not forever) and it requires some form of coherence, a big complication that Tenstorrent wanted to avoid. Instead, processing nodes communicate over the networks.

Each processing node – which the company calls a Tensix – contains five single-issue RISC cores, one “compute engine,” and a network packet processor for communication over the on-chip networks. Each node also has 1 MB of local SRAM. The initial 120-node device therefore has 600 RISC cores, 120 compute engines, and 120MB of SRAM. The chip also has eight global LPDDR4 memory interfaces.

The five RISC cores within each node are identical, but they don’t all necessarily run the same code at the same time. They’re capable of basic arithmetic and logic operations, and they manage flow control. They also compete for the more-advanced hardware resources of the shared compute engine, which is where matrix, convolution, and vector/SIMD operations are carried out.

if(get_post_type() !== 'chalk_talks') { ? } ?

Tenstorrent’s chip (codenamed Grayskull) takes a divide-and-conquer approach to inference. Tensors are broken into fixed-length packets, much like Ethernet packets, and distributed among several Tensix processor nodes, one packet per node. Each packet then lands in the node’s incoming SRAM buffer, where it’s unpacked and fed to the compute engine under the direction of the RISC cores. From there, the results are packetized and stored in an outgoing SRAM buffer ready for transmission to the next Tensix node downstream.

Each node is connected to its immediate neighbors via two different network rings, one running north/south and the other east/west. Tenstorrent says that each node can send/receive up to 384 GB/sec, and that this number doesn’t degrade with traffic. A multicast feature allows one node to send its results to multiple recipients, if necessary.

Any number of Tensix nodes may participate in any given job, determined by the nature of the task and by Tenstorrent’s compiler. Hardware allocation is determined statically at compile time. Assuming the task doesn’t require all 120 nodes, Grayskull can handle two or more unrelated inference tasks at once. The nearest-neighbor network requires that cooperating nodes be adjacent, however, so physical layout of the task becomes important. This, too, is the compiler’s job.

Not surprisingly, the compiler is also Tenstorrent’s own design. It accepts semi-standard ML inputs PyTorch and ONNX, which it optimizes, converts to Tensix primitives, and maps to physical on-chip nodes. The optimization is broken into stages, which can be interrupted and manually tweaked if desired. This allows third-party tools to help with parallelization or other tasks, for example.

The output of the compiler isn’t Tensix binary code, per se, but is more akin to pseudocode or bytecode. Each of the five RISC cores within each processing node runs firmware to translate the output of the compiler to actual hardware operations. This layer of abstraction allows Tenstorrent to tweak its hardware without breaking software compatibility. It also, not incidentally, allows developers to run code on non-Tenstorrent hardware. An Intel x86 processor with the appropriate firmware layer, for instance, can potentially run Tenstorrent binaries for testing and profiling.

Tenstorrent has working silicon of its 120-node Grayskull chip, fabricated in GlobalFoundries’ 12nm process. They’ve found “no fatal bugs so far,” according to Bajic, and are currently characterizing its power, performance, and abilities. Right now, it looks like it cranks out about 368 TOPS (trillions of integer operations per second) in a bus-powered 75W PCIe card. It can also do floating-point operations at one-quarter the throughput (i.e., nearly 100 teraFLOPS). That compares favorably to nVidia’s Tesla V100 chips, Tenstorrent’s most obvious competitor. If production silicon is as good as the early Grayskull samples, Tenstorrent could become the new masters of the ML universe.