Cortex v0.13: deploy machine learning models in production

An open source machine learning platform for developers

Two years ago, my colleague and I came across TensorFlow, and despite having no idea what backpropagation or hidden layers were, we decided it would be cool to build machine learning applications. We found an abundance of resources for learning the basics of training machine learning models, but less information about deploying models as scalable web services. There wasn’t a clear path from Jupyter notebook to production.

In retrospect, this isn’t surprising because working with TensorFlow, PyTorch, or other machine learning frameworks requires a very different skillset than dealing with Docker, Kubernetes, NVIDIA drivers, and various AWS services.

Without the right infrastructure, models can take weeks instead of minutes to go from laptop to cloud, request latencies can be too high to provide an acceptable user experience, and production workloads can incur massive compute costs.

We realized that big tech companies like Uber, Netflix, and Spotify have in-house machine learning infrastructure teams to empower their data scientists and machine learning engineers to deploy models in production.

We’re building Cortex for everyone else.

Scaling real-time inference is hard

Our goal is not to reinvent the wheel, and instead to use as much existing technology as possible to solve the problem. There are, however, some unique challenges in scaling machine learning inference that require Cortex to work differently than other deployment platforms.

Let’s assume you want to deploy OpenAI’s 1.5B parameter GPT-2 as a web service to add text generation functionality to your app:

GPT-2 is compute hungry: It may utilize a CPU at nearly 100% for several minutes to return one paragraph of text. Users frequently abandon websites with seconds of latency, let alone minutes.

GPT-2 is memory hungry: Besides CPU, GPT-2 needs a lot of RAM for a single inference. If your underlying web server can’t provide a huge amount of RAM your latency gets even worse or the API may crash.

GPT-2 is >5GB: Just loading it into memory takes a while, so naive approaches to updating a live web service could result in minutes of downtime on every update.

GPT-2 in production is expensive: You may need to deploy more servers than you have concurrent users if each user is making several requests per minute.

Cortex makes scaling real-time inference easy

Cortex is a platform for deploying machine learning models as production web services. It is designed for running real-time inference at scale. Autoscaling, CPU and GPU support, and spot instance support allow you to run large inference workloads without racking up huge AWS bills. Rolling updates, log streaming, and prediction monitoring enable rapid iteration while minimizing downtime. Supporting multiple frameworks while requiring minimal configuration make Cortex clusters and deployments easy to launch and maintain.

Autoscaling

Production workloads aren’t always predictable, so Cortex automatically scales your prediction APIs to meet maximum traffic workloads to avoid high latency, and scales down automatically when traffic is lower to reduce your AWS bill.

CPU and GPU support

Cortex web services can seamlessly run on CPUs, GPUs, or both. While CPUs get the job done for simple models, GPUs are necessary to run large deep learning models fast enough to provide API responses in real-time without compromising end user experience.

Spot instances

Inference can get expensive fast because it can be so compute intensive. That being said, spot instances can unlock significant discounts with the caveat that AWS can reclaim the instance at any time. Cortex has built-in fault tolerance so you don’t have to worry.

Rolling updates

Suppose you have 100s of GPU instances serving requests to your users, and now you’ve figured out a way to train a more accurate model. Cortex makes it easy to transition your web service to the new model without affecting its availability or latency.

Log streaming

Debugging machine learning models is hard, but seeing the logs in real-time can help streamline the process. For example, real-time logs can be monitored to ensure that request payloads are transformed correctly to match the model’s input schema.

Prediction monitoring

Production web services need to be monitored. For machine learning APIs, it’s especially important to track predictions to ensure that models are performing as expected.

Multi framework

Cortex supports all the Python machine learning frameworks: TensorFlow, Keras, PyTorch, scikit-learn, etc. Data scientists and machine learning engineers have different preferences when it comes to the tools they use to build models, and their deployment infrastructure should accommodate all frameworks with a deployment API that’s as close to uniform as possible.

Minimal configuration

Configuration should be simple, flexible, and reproducible. cluster.yaml files create predictable clusters and cortex.yaml files create predictable model deployments with minimal verbosity.

Design decisions

Our high-level philosophy is that shipping production machine learning web services requires both machine learning and distributed systems expertise. It’s rare to find people who have experience with both. We decided to be opinionated when it comes to infrastructure decisions, and leave all the data science decisions to our users.

Cloud native

Cortex is for production use cases. Some of our users run clusters with 100s of GPUs to handle their production traffic. That’s hard to do on a laptop.

AWS

Cortex can be deployed in your AWS account. That means that you’ll have access to all the instances, autoscaling groups, security groups, and other resources that get provisioned for you when you launch a Cortex cluster. It also means that your machine learning infrastructure spending is fully visible in your AWS billing dashboard.

We get a lot of questions about GCP, Azure, and private cloud support. Right now, we’re focused on AWS because we’re a small team and we really want to get the experience right without spreading ourselves too thin.

EKS

Choosing the right AWS compute service wasn’t obvious. We wanted a managed service that didn’t charge a premium as a function of EC2 costs which helped us eliminate Fargate and SageMaker.

We also wanted to be able to run arbitrary containers with potentially large compute and memory needs which eliminated Lambda. We were left with ECS and EKS and ultimately chose EKS because building on the Kubernetes APIs opens the door to supporting other cloud providers more easily.

Open Source

We believe that our job is to build a product that developers love. We’re early in our journey but we think that building Cortex with our community is one of the most rewarding things we’ve ever done.

Get started