Written by James Orme Wed 8 May 2019

Google has assembled thousands of Tensor Processor Units (TPUs) into giant programmable supercomputers and made them available on Google Cloud

Google first announced the machine learning TPUs at Google I/O in 2016, by which point they had been running in its own data centres for a year. Every year since Google has released incremental “v” updates and also made them available to rent on GCP.

Three years later, the company has used the conference to announce that the custom hardware accelerators have been merged into multi-rack ML supercomputers called Cloud TPU Pods, that are also available to rent as a service.

To be precise, Google has used a “two-dimensional toroidal” mesh network to enable multiple racks of TPUs to be programmable as one colossal AI supercomputer. The company says more than 1,000 TPU chips can be connected by the network.

Google claims each TPU v3 pod can deliver more than 100 petaFLOPS of computing power, which puts them amongst the world’s top five supercomputers in terms of raw mathematical operations per second. Google added the caveat, however, that the pods operate at a lower numerical precision, making them more appropriate for superfast speech recognition or image classification – workloads that do not need high levels of precision.

Google says the pods can reduce ML workload completion times from weeks to hours. For instance, Recursion Pharmaceuticals, that iteratively tests the viability of synthesized molecules to treat rare illnesses, reduced training times from 24 hours on its on-premise cluster of TPUs to 15 minutes on a Cloud TPU Pod.

Google recommends ML teams consider the pods if they require:

Shorter time to insights —iterate faster while training large ML models

—iterate faster while training large ML models Higher accuracy —train more accurate models using larger datasets (millions of labelled examples; terabytes or petabytes of data)

—train more accurate models using larger datasets (millions of labelled examples; terabytes or petabytes of data) Frequent model updates —retrain a model daily or weekly as new data comes in

—retrain a model daily or weekly as new data comes in Rapid prototyping—start quickly with our optimized, open-source reference models in image segmentation, object detection, language processing, and other major application domains

TPU v2 Pods and Cloud TPU v3 Pods are now available in public beta as full pods or as “slices”.