Gentlemen (and women), start your inference engines.

One of the world’s largest buyers of systems is entering evaluation mode for deep learning accelerators to speed services based on trained models. We now have a pretty decent sense of what they are looking for and, surprise, surprise, it is wrapped in an incredibly tiny power envelope.

We had the opportunity today to talk Facebook’s VP of Infrastructure, Jason Taylor, about how the hyperscaler is moving beyond its CPU only approach to AI inference and looking to the growing plethora of choices from chipmaking giants like Intel and Nvidia and smaller offerings from startups and those toiling at the embedded edge.

The takeaway is that at scale, CPUs are not up to the task—a tough thing to admit for Facebook that takes great pride in its limited SKUs for all of its workloads. As we wrote back in March 2016, there are six types of servers for specific workloads in their datacenters, which means that once the company picks an architecture, it builds around it for the long haul. Taylor says the same thing will be true with inference. For now, they are experimenting with many new devices, trying to understand the right balance of memory, compute, and how that yields for efficiency. But in four to five years, he says, they will have picked one, perhaps two, architectures that will fit the bill at scale at narrow that down to a single SKU, adding perhaps a seventh server type to the lineup.

Facebook has been running deep learning workloads for years now, but as Taylor tells The Next Platform, “We’ve done all inference on just traditional CPUs and I think what we expect is that with the amount of inference we’ll be doing in the future is going to be such that we will want more hardware acceleration. It’s been a persistently growing demand on our infrastructure and now is the right time for Facebook to start preparing for having hardware accelerators in our datacenters.”

The biggest question is what architecture will best fit the bill. The answer starts with what types of deep learning frameworks and workloads are most important. Taylor says for now, convolutional and recurrent neural networks are critical but that sparse neural networks are growing in importance. The first two are well-understood and architected for in training and inference hardware already, but the latter means choosing an architecture that can handle dense matrix multiplication in an ultra-power efficient package.

“The idea early on was that there had to be a lot of specialization in a device for RNNs or sparse neural networks and that this required specialized matrix multiply operations. But where the industry is at the moment, just being able to do general matrix multiply—and most problems we have can be mapped to this pretty well—is what we need.” This is the case for convolutions and recurrent nets, but sparse networks mean making certain decisions about how much memory (read as heat and power) is required—and how to balance that with optimizations in software and the networks themselves. We will get to the compiler side of this story that speaks to that in a bit, but Taylor says that power is the top focus for what they evaluate.

“The most surprising thing about our focus on neural networks and hardware inference acceleration is that we are very aware of power requirements. We like to see chips that are smaller and fit into very small power footprints. Having a 10 watt chip is way more attractive to us than having a 100 watt chip for inference. We think it works well to have many of these chips on a system rather than one big chip. Each of these chips is more powerful than a CPU in doing matrix math.”

Now, that is interesting, given the fact that we have seen ultra-low power MAC-dense inference chips and others like the one we just announced from Nvidia based on Turing yesterday. In other words, less is more, especially if it is efficient for Facebook to sling many of these together and strike efficiency balances that way. But an inference chip at 75 watts? We’ll be writing about this more this week.

“The hardware industry is no stranger to seeing periods of fragmentation, then a recognition for the most important workloads for customers. The tough thing now is that the inference workload is so unlike where most CPUs are targeted that even if you do see those resources on chip, it is more likely to be a system on chip approach, perhaps like a SIMD architecture that accelerates part of a workload,” Taylor says. “But we’ll see how it works out and how much demand there is. At least for the next five year, maybe longer, we’ll be seeing these processors as separate units. When you package a chip, it’s hard to get everything right and once you specialize for one group of customers or another, you start taking the volumes out and cost becomes an issue. My guess is that inference will be separate chips for quite a while.

With this idea that Facebook sees separate inference chips in mind, those many deep learning architectures that promise both training and inference on the same device might find themselves left out of the hyperscale deep learning game. It all comes back to balance. These companies require powerful training but low power inference. Putting everything on the same chip for both comes with a lot of extra inference overhead power-wise.

“There are many challenges with machine learning training. Training is a much wider systems problem, inference I expect will be a low power chip that will be a resource we can make available next to our primary compute servers. They will be on completely different sides of the datacenter,” Taylor confirms.

The ability to work with a number of new architectures for inference only means bringing hardware partners into the fold and pulling their devices and stacks into the way Facebook designs its operations. The key to this is the Glow Compiler, which Facebook described in detail today. The low level details can be found in this detailed paper, but generally speaking, Taylor says having an open source point to collect around will allow them to more objectively work with new devices without getting mired in complexity of different compilers for different hardware devise.

Among the hardware partners are Qualcomm, Marvell, Intel, and Esperanto, as well as Cadence. Notice who is missing from this list? Here’s a hint—they make GPUs and their CEO really likes leather jackets. A lot.

The Glow compiler effort is focused on machine learning inference, a workload that is garnering a wide array of new devices from established and new companies alike, all of whom are pushing efficiency and looking for ways to carve down the memory and heat overhead.

For now, the company says that as they “look to proliferate deep learning frameworks such as PyTorch, they need compilers that provide a variety of optimizations to accelerate inference performance on a range of hardware platforms to support a growing number of AI and machine learning industry needs.” They key to this effort is the Glow (short for Graph-Lowering) compiler, which is an open source compiler that they say allows hardware designers to quickly ramp new machine learning chips by removing the weight of one of the most difficult elements, the compiler.

Hardware accelerators are specialized to solve the task of machine learning execution. They typically contain a large number of execution units, on-chip memory banks, and application-specific circuits that make the execution of ML workloads very efficient. To execute machine learning programs on specialized hardware, compilers are used to orchestrate the different parts and make them work together. Machine learning frameworks such as PyTorch rely on compilers to enable the efficient use of acceleration hardware.

As the Glow development team notes, “Each model has a different memory and processor configuration. Glow is designed to target a wide range of hardware accelerators. The hardware independent parts of the compiler focus on math-related optimizations that are not tied to a specific hardware model. In addition to the target-independent optimizations, Glow contains a number of utilities and building blocks that can be configured to support multiple hardware targets.”

For example, the memory allocator of the compiler is used to generate efficient code for a variety of hardware accelerators, each with a different memory configuration. These capabilities include a powerful linear algebra optimizer, an extensive test suite, a CPU-based reference implementation for testing the accuracy of the hardware, as well as a memory allocator, an instruction scheduler, etc.

We have been briefed (pre-briefed) on several inference chips coming out next week around AI hardware events and will do the homework and compare power consumption and capability on the new entrants and the existing inference options with this in mind.

So, rev up, inference hardware makers. There is a compiler to build around now, and a better sense of what Facebook needs at scale to serve its base. Exciting times ahead for hardware—and we’re always happy about that.