Our customers are taking to Machine Learning in a big way. They are running many different types of workloads, including object detection, speech recognition, natural language processing, personalization, and fraud detection. When running on large-scale production workloads, it is essential that they can perform inferencing as quickly and as cost-effectively as possible. According to what they have told us, inferencing can account for up to 90% of the cost of their machine learning work.

New Inf1 Instances

Today we are launching Inf1 instances in four sizes. These instances are powered by AWS Inferentia chips, and are designed to provide you with fast, low-latency inferencing.

AWS Inferentia chips are designed to accelerate the inferencing process. Each chip can deliver the following performance:

64 teraOPS on 16-bit floating point ( FP16 and BF16 ) and mixed-precision data.

and ) and mixed-precision data. 128 teraOPS on 8-bit integer (INT8) data.

The chips also include a high-speed interconnect, and lots of memory. With 16 chips on the largest instance, your new and existing TensorFlow, PyTorch, and MxNet inferencing workloads can benefit from over 2 petaOPS of inferencing power. When compared to the G4 instances, the Inf1 instances offer up to 3x the inferencing throughput, and up to 40% lower cost per inference.

Here are the sizes and specs:

Instance Name

Inferentia Chips

vCPUs RAM EBS Bandwidth Network Bandwidth inf1.xlarge 1 4 8 GiB Up to 3.5 Gbps Up to 25 Gbps inf1.2xlarge 1 8 16 GiB Up to 3.5 Gbps Up to 25 Gbps inf1.6xlarge 4 24 48 GiB 3.5 Gbps 25 Gbps inf1.24xlarge 16 96 192 GiB 14 Gbps 100 Gbps

The instances make use of custom Second Generation Intel® Xeon® Scalable (Cascade Lake) processors, and are available in On-Demand, Spot, and Reserved Instance form, or as part of a Savings Plan in the US East (N. Virginia) and US West (Oregon) Regions. You can launch the instances directly, and they will also be available soon through Amazon SageMaker and Amazon ECS, and Amazon Elastic Kubernetes Service.

Using Inf1 Instances

Amazon Deep Learning AMIs have been updated and contain versions of TensorFlow and MxNet that have been optimized for use in Inf1 instances, with PyTorch coming very soon. The AMIs contain the new AWS Neuron SDK, which contains commands to compile, optimize, and execute your ML models on the Inferentia chip. You can also include the SDK in your own AMIs and images.

You can build and train your model on a GPU instance such as a P3 or P3dn, and then move it to an Inf1 instance for production use. You can use a model natively trained in FP16, or you can use models that have been trained to 32 bits of precision and have AWS Neuron automatically convert them to BF16 form. Large models, such as those for language translation or natural language processing, can be split across multiple Inferentia chips in order to reduce latency.

The AWS Neuron SDK also allows you to assign models to Neuron Compute Groups, and to run them in parallel. This allows you to maximize hardware utilization and to use multiple models as part of Neuron Core Pipeline mode, taking advantage of the large on-chip cache on each Inferentia chip. Be sure to read the AWS Neuron SDK Tutorials to learn more!

— Jeff;