On Monday night I described AWS Graviton , the general-purpose AWS-developed server processor with 64-bit Arm that powers the EC2 A1 instance family. The five members of the A1 instance family target scale-out workloads such as web servers, caching fleets, and development workloads. This is the first general-purpose processor that has been designed, developed, and deployed by AWS. Today we’ll look at another AWS-developed processor, the AWS Inferentia Application Specific Integrated Circuit (ASIC). Rather than a general-purpose processor like Graviton, this part focuses on machine learning inference. Why develop a workload specialized processor at all since, by definition, general-purpose processors can support a wider class of workloads?

It’s true that general-purpose processors have dominated workloads for many years because they have very high volume and, consequently, much lower costs. In the past, these lower costs tended to swallow the advantage of hardware specialization. When you have only a handful of servers running a specific workload, it’s hard to economically justify hardware optimization.

Outside the world of compute, we have seen hardware specialization win by a wide margin in Networking. Network packet processing is highly specialized and networking protocols change infrequently. The volumes are high enough that specialization is highly economic and, as a consequence, most network packet processing is done using specialized Application Specific Integrated Circuits (ASICs). Most routers, regardless of source, are built upon specialized ASICs. AWS uses network specialized ASICs for all routers and every server includes at least one, and often more. AWS custom ASICs power the Nitro System, handling network virtualization, packet processing, some storage operations, as well as supporting specialized security features.

Hardware specialization can improve latency, price/performance and power/performance by as much as ten times and yet, over the years, most compute workloads have stubbornly stayed on general purpose processors. Typically each customer only has a handful of a given server type, so hardware specialization usually made little sense. But, the cloud changes all that. In a successful and broadly used cloud, even “rare” workloads can number in the thousands to tens of thousands.

Whereas in the past it was nearly impossible for an enterprise to financially justify hardware specialization in all but fairly exotic workloads, in the cloud there are thousands to possibly tens of thousands of even fairly rare workloads. Suddenly, not only is it possible to use hardware optimized for a specific workload type, but it would be crazy not to. In many cases it can deliver an order of magnitude of cost savings, consume as little as 1/10th the power, and these specialized solutions can allow you to give your customers better service at lower latency. Hardware specialization is the future.

Believing that hardware specialization is going to be a big part of server-side computing going forward, Amazon has had a custom ASIC team focused on AWS since early 2015 and, prior to that, we worked with partners to build specialized solutions. Two years ago at re:Invent 2016, I showed the AWS custom ASIC that has been installed in all AWS servers for many years (Tuesday Night Live with James Hamilton). Even though this is a very specialized ASIC, we install more than a million of these ASIC annually and that number continues to accelerate. In the server world, it’s actually a fairly high volume ASIC.

I’ve long predicted that machine learning workloads will require more server resources than all current forms of server computing combined. The customer value of machine learning applies to just about every domain and the potential gains are massive. There are many frequently discussed workloads like autonomous driving that require vast resources, but machine learning has immediate applicability in just about every business. It applies to customer service, insurance, finance, heating and cooling, and manufacturing. It’s rare that a technology can have such broad application and, when the gains are this big, it’s a race for most businesses. Those that apply machine learning first and more deeply can serve customers more effectively and more economically. The pace of innovation is incredibly fast, and research scientist and developer skills are in high demand. At AWS, we’re focused on making it easy to deploy machine learning quickly while at the same time driving down costs making it more economical for more workloads to use machine learning.

Clearly these are still early days, but it does appear to be working. More machine learning is happening on AWS than any other cloud computing platform with over 10,000 active machine learning developers and twice the customer references of our nearest competitor. We support the frameworks customers are using, and have AWS hardware-optimized versions of the important ML frameworks including MxNet, PyTorch, and TensorFlow. In fact, 85% of cloud-hosted TensorFlow workloads are running on AWS.

Machine learning training can be very resource intensive and, of course, you can’t deploy a machine learning system until the model is trained. Consequently, training gets a lot of focus and it has to be a big part of every machine learning discussion. But inference is where the work actually gets done. This is where speech is recognized, text is translated, object recognition in video occurs, manufacturing defects are found, and cars get driven. Inference is where the value of ML is delivered, for example, inference is what powers dozens of AWS ML Services like Amazon Rekognition Image and Video, Lex, Poly, Comprehend, Translate, Transcribe, and Amazon SageMaker Hosting.

The AWS P3 instance family, delivering up to 8 NVIDIA Tesla V100 GPUs, has been remarkably popular with customers in many domains, but especially in high performance computing and machine learning. As popular as it is, it’s easy to forget that most machine learning is still supported on general purpose processors. It’s likely that most, if not all, of the 175 different EC2 instances offered by AWS are being used by customers right now to deliver machine learning inference results. Today we expand on what is already the industry’s broadest instance offering by announcing Inferentia.

Inferentia offers scalable performance from 32 TOPs to 512 TOPS at INT8 and our focus is on high scale deployments common to machine learning inference deployments where costs really matter. We support near-linear scale-out, and build on high volume technologies like DRAM rather than HBM. We offer INT8 for best performance, but also support mixed-precision FP16 and bfloat16 for compatibility, so that customers are not forced to go through the effort of quantizing their neural networks that have been trained in FP16, FP32 or bfloat16, and don’t need to compromise on which data type to use for each specific workload. Working with the wider Amazon and AWS Machine learning service teams like Amazon Go, Alexa, Rekognition, and SageMaker helps our Inferetia hardware and software meet wide spectrum of inference use cases and state of art neural networks. We support ONNX (https://onnx.ai/) as the industry most commonly used exchange format for neural network, and interface natively with the three popular frameworks: MxNet, PyTorch, and TensorFlow, providing customers with choices.

The largest and most used machine learning platform in the cloud just got another important tool to reduce the cost of deploying machine learning inference at scale.