Do you remember the movie “Ex Machina“? The most captivating part of the movie is when – Ava, an AI robot, becomes self-aware and deceptive than the creator ever imagined. Apart from the movie, we certainly can’t comment on when we are going to witness that in reality, but we are witnessing the unfolding of a golden era: the era of AI (Artificial Intelligence). And all thanks to ML (machine learning).

Debates like, “Whether AI will take over humans or not?” builds a succinct foundation of this golden era. Big shots in the tech industry like Microsoft, Google, Facebook has already started rolling out AI-enabled products, which, if not ignored, to bring more value to the table.

Compute Power is the key for ML

The ML-powered products are typically dependent on the computation power at disposal. This actually becomes the thumb rule in ML and AI domain, i.e., higher the availability of cutting edge compute resources, the easier it will be to work on ML and AI projects. An ML practitioner has to wait for hours, sometimes days, or even months to train their ML models; and this variation in time is due to the computational power, playing a crucial role.

So, one thing is clear that the future of advanced self-learning technologies such as ML and AI is dependent on the focused development of dedicated and purpose-built hardware chips. The chips that will be capable of supporting the computational power that such models require. Notably, Nvidia and Intel are manufacturing the chips for AI-powered products, and tech giants are their glorified customers.

But something unexpected happened in November 2018, Amazon announced to manufacture its own machine learning chip called INFERENTIA or Amazon Inferentia.

What makes Amazon Inferentia chips so important?

ML engineers, AI scientists, and the cloud evangelists, everybody is asking a ton of questions around Amazon Inferentia. To put everything into perspective, we need to launch ourselves into Machine Learning space.

Typically, there are two phases involved in any machine learning project that turn into product or services, i.e., training and inference.

Training Phase

Training, as the name suggests, involves a distinct process of feeding a machine with required data. A machine is trained to learn the patterns from a given set of data. It is a one-time process that focuses on making machines smarter as it learns complex algorithms based on mathematical functions. The training phase is comparable to a classroom scenario – a professor teaching a particular topic to his or her students. The professor is key at this stage.

Inference Phase

After learning all the complex algorithms, the machine is ready for the inference phase. How advanced an ML is, can only be defined by how a “trained” system responds in the inference phase. Unlike the training phase, it is not a one-time process; in fact, there could be millions of people making use of those trained models at the same time. We will leave you with another comparable scenario, i.e., the inference phase is like a student using the learned knowledge in real-world situations. The students are the key at this stage.

Amazon has always focused on the ownership of the entire product, even if that means building from scratch. For a long time, Amazon Web Services (AWS) had been using the chips manufactured by Nvidia and Intel. During re:Invent 2019, AWS announced a new chip dedicated to the inference phase – Amazon Inferentia.

Suggested Reads: AWS betting big on Machine Learning with Amazon SageMaker

Deep dive into Amazon Inferentia

The end of the last decade witnessed a massive demand for deep learning acceleration that, too, across a wide range of applications. Dynamic pricing, image search apps, personalized search recommendations, automated customer support, etc. applications are using ML concepts. Not to mention, there are a plethora of applications that will inevitably increase in the coming years. The challenges with ML is that it’s complex, expensive, and lack of infrastructure optimized to execute ML algorithms.

Suggested Read: Top 7 announcements from AWS re:Invent 2019 that made headlines

In addition to this, Amazon keeps a close eye on its arch-rivals. Google announced it’s first custom machine learning chip, Tensor Processing Units (TPUs), in 2016. Google is currently offering third-generation of TPUs as a cloud service. So, it seems a pretty obvious choice for Amazon with resources and technology available at the company’s disposal.

Meet the creator of Amazon Inferentia

Amazon acquired Annapurna, an Israeli start-up, in 2015. Engineers from Amazon and Annapurna Labs built the Arm Graviton processor and Amazon Inferentia chip.

Technical Specifications

Amazon Inferentia Chips consists of 4 Neuron Cores. Each Neuron Core implements a “high-performance systolic array matric multiply engine.” (fancy words for interconnected hardware performing specific actions with less response time).

As per the technical definition, “In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbors, stores the result within itself and passes it downstream.”

Benefits of AWS Inferentia

High performance

Each chip with 4 Neuron Cores can perform up to 128 TOPS (trillions of operations per second). It supports BF16, INT8, and FP16 data types. One interesting thing is that AWS Inferentia can take a 32-bit trained model and run it at the speed of a 16-bit model using BFloat16.

Low Latency for real-time output

You must have heard this during the re:Invent 2019 that Inferentia provides lower Latency. Here’s how?

As ML gets more sophisticated, the models grow, and transferring the models in and out of the memory becomes the most crucial task, which was supposed to be improving the model algorithm. This brings high Latency and magnifies the computation issues. Amazon Inferentia chip holds the capabilities to solve the latency issues to a much greater extent.

Chips are interconnected that serves two purposes. First, one can partition a model across multiple cores with 100% on-cache memory storage — stream data at full speed through the pipelines of cores preventing the Latency caused by external memory access.

Supports all the frameworks

ML practitioners work with a wide variety of frameworks. AWS makes it easy for ML enthusiasts to run AWS Inferentia on almost every framework available. To run Inferentia, the models need to be compiled to a hardware-optimized representation. This might seems too pro level, but no, the operations can be performed through command-line tools available in the AWS Neuron SDK or via framework APIs.

Democratizing access to the hardware required for ML

Running ML models for hours, weeks, or sometimes for months is an expensive affair. Organizations handling and building applications with ML may not be able to bear all the expenses for owning, operating, and maintenance of the hardware with higher computational power.

So, AWS has still not released any pricing regarding Inferentia except Amazon EC2 Inf1 instances (an inferentia chip powered instance). But it is certain that the customer’s challenges to reduce inference phase cost must have paved the way for Amazon Inferentia.

What’s next in Machine Learning for AWS?

AWS made more than a dozen announcements of services and products enhancing ML. We can’t ignore the Amazon SageMaker announcements, which came as a gift from AWS for the organizations and individuals who preach ML.

AWS will look forward to adding Inferentia chips to other instances like EC2. This will add more depth to the compute portfolio of AWS. Amazon’s robust strategy to add custom-built best in the industry chips can flourish at an exponential rate, only if, they are able to deliver the hardware services at the lightening speed.