Four years ago, Google started to see the real potential for deploying neural networks to support a large number of new services. During that time it was also clear that, given the existing hardware, if people did voice searches for three minutes per day or dictated to their phone for short periods, Google would have to double the number of datacenters just to run machine learning models.

The need for a new architectural approach was clear, Google distinguished hardware engineer, Norman Jouppi, tells The Next Platform, but it required some radical thinking. As it turns out, that’s exactly what he is known for. One of the chief architects of the MIPS processor, Jouppi has pioneered new technologies in memory systems and is one of the most recognized names in microprocessor design. When he joined Google over three years ago, there were several options on the table for an inference chip to churn services out from models trained on Google’s CPU and GPU hybrid machines for deep learning but ultimately Jouppi says he never excepted to return back to what is essentially a CISC device.

We are, of course, talking about Google’s Tensor Processing Unit (TPU), which has not been described in much detail or benchmarked thoroughly until this week. Today, Google released an exhaustive comparison of the TPU’s performance and efficiencies compared with Haswell CPUs and Nvidia Tesla K80 GPUs. We will cover that in more detail in a separate article so we can devote time to an in-depth exploration of just what’s inside the Google TPU to give it such a leg up on other hardware for deep learning inference. You can take a look at the full paper, which was just released, and read on for what we were able to glean from Jouppi that the paper doesn’t reveal.

Jouppi says that the hardware engineering team did look to FPGAs to solve the problem of cheap, efficient, and high performance inference early on before shifting to a custom ASIC. “The fundamental bargain people make with FPGAs is that they want something that is easy to change but because of the programmability and other hurdles, compared with an ASIC, there ends up being a big difference in performance and performance per watt,” he explains. “The TPU is programmable like a CPU or GPU. It isn’t designed for just one neural network model; it executes CISC instructions on many networks (convolutional, LSTM models, and large, fully connected models). So it is still programmable, but uses a matrix as a primitive instead of a vector or scalar.”

The TPU is not necessarily a complex piece of hardware and looks far more like a signal processing engine for radar applications than a standard X86-derived architecture. It is also “closer in spirit to a floating point unit coprocessor than a GPU,” despite its multitude of matrix multiplication units, Jouppi says, noting that the TPU does not have any stored program; it simply executes instructions sent from the host.

The DRAM on the TPU is operated as one unit in parallel because of the need to fetch so many weights to feed to the matrix multiplication unit (on the order of 64,000 for a sense of throughput). We asked Jouppi how the TPU scales and while it’s not clear what the upper limit is, he says anytime you’re working with an accelerator that has host software there’s going to be a bottleneck. We are still not sure how these are lashed together and to what ceiling, but we can imagine, given the need to have consistent, cheap hardware to back inference at scale, it is probably not an exotic RDMA/NVlink approach or anything in that ballpark.

There are two memories for the TPU; an external DRAM that is used for parameters in the model. Those come in, are loaded into the matrix multiply unit from the top. And at the same time, it is possible to load activations (or output from the “neurons”) from the left. Those go into the matrix unit in a systolic manner to generate the matrix multiplies—and it can do 64,000 of these accumulates per cycle.”

It might be easy to say that Google could have deployed some new tricks and technologies to speed the performance and efficiency of the TPU. One obvious choice would be using something like High Bandwidth Memory or Hybrid Memory Cube. However, the problem at Google’s scale is keeping the distributed hardware consistent. “It needs to be distributed—if you do a voice search on your phone in Singapore, it needs to happen in that datacenter—we need something cheap and low power. Going to something like HBM for an inference chip might be a little extreme, but is a different story for training.”

Of course, the true meat of this paper is in the comparisons.

While Google has been thorough in its testing, pitting its TPU against both CPUs and GPUs, given that most of the machine learning customer base (with the notable exception of Facebook) uses CPUs for processing inferences, the comparisons to the Intel “Haswell” Xeon E5 v3 processors is no doubt the one that is most appropriate and this is one that shows without a doubt that the TPU blows away Xeon chips on multiple dimensions when it comes to inference workloads. And that explains why Google went to the trouble of creating its own TPU chip when it would have by far preferred to keep running machine learning inference on its vast X86 server farms.

In Google’s tests, a Haswell Xeon E5-2699 v3 processor with 18 cores running at 2.3 GHz using 64-bit floating point math units was able to handle 1.3 Tera Operations Per Second (TOPS) and delivered 51 GB/sec of memory bandwidth; the Haswell chip consumed 145 watts and its system (which had 256 GB of memory) consumed 455 watts when it was busy.

The TPU, by comparison, used 8-bit integer math and access to 256 GB of host memory plus 32 GB of its own memory was able to deliver 34 GB/sec of memory bandwidth on the card and process 92 TOPS – a factor of 71X more throughput on inferences, and in a 384 watt thermal envelope for the server that hosted the TPU.

Google also examined the throughput in inferences per second that the CPUs, GPUs, and TPUs could handle with different inference batch sizes and with a 7 milliseconds.

With a small batch size of 16 and a response time near 7 milliseconds for the 99th percentile of transactions, the Haswell CPU delivered 5,482 inferences per second (IPS), which was 42 percent of its maximum throughout (13,194 IPS at a batch size of 64) when response times were allowed to go as long as they wanted and expanded to 21.3 millisecond of the 99th percentile of transactions.The TPU, by contrast, could do batch sizes of 200 and still meet the 7 millisecond ceiling and delivered 225,000 IPS running the inference benchmark, which was 80 percent of its peak performance with a batch size of 250 and the 99th percentile response coming in at 10 milliseconds.

We realize that Google tested with a relatively old Haswell Xeon, and that with Broadwell Xeon E5 v4 processors, the instructions per clock (IPC) of the core went up by around 5 percent due to architectural changes and with the future “Skylake” Xeon E5 v5 processors due this summer, the IPC might rise by 9 percent or even 10 percent if history is any guide. Increase the core count from the 18 with the Haswell to the 28 in the Skylake, and the aggregate throughput of a Xeon on inference might rise by 80 percent. But that does not come close to closing inference gap with the TPU.

Recall again. This is an inference chip, hence the GPU compares we see in the paper are not the knocks against GPUs in general it may seem–they still excel at training models at Google alongside CPUs. The real test for CPU makers (okay, Intel) is to provide something with great inference performance at cost and efficiency figures that can at least come close to a custom ASIC–something most shops can’t do.

“A lot of the architectures were developed a long time ago. Even if you look at AVX extensions for X86—as Moore’s Law scales these have gotten wider but they are still one-dimensional. So when we were looking at FPGAs as one idea and saw that wouldn’t work we started thinking about the amount of silicon we could get out of 28nm—that allows for things that were unheard of back in the day when the vector was a high end machine. Now, we can take it to the next level with a tensor processing unit with the matrices at the heart—it is exciting because we don’t get to change often in this industry…From scalars to vectors, and now to two-dimensional tensors.”

“We did a very fast chip design, it was really quite remarkable. We started shipping the first silicon with no bug fixes or mask changes. Considering we were hiring the team as we were building the chip, then hiring RTL people and rushing to hire design verification people, it was hectic,” Jouppi says.

The paper is being presented by another pioneer in computing–David Patterson, the inventor of RAID disk protection and “father of RISC”… quite a stellar (and large) team on this.