Why we built our ML platform on AWS — and why that may have been a mistake Omer Spillinger Follow Jan 20 · Unlisted

When we started building Cortex, a platform for deploying machine learning models in production, we knew we wanted it to run as a self-hosted service on any cloud account. We also knew that as a small engineering team, we should focus on a single cloud provider, so we chose AWS.

Our thought process was simple: start with the cloud provider with the most users. We assumed that the most interesting machine learning use cases would be at larger companies, who are predominantly using AWS. We also felt that our decision was validated by other infrastructure companies like Elastic, Databricks, and Cockroach Labs who prioritized AWS before other cloud providers for their products.

After about a year as an open source project, we’ve started seriously questioning our decision. The early innings of cloud infrastructure were dominated by AWS, but the consensus is starting to change:

Our #1 feature request is GCP support

While social media can hint at where the wind is blowing, community feedback is our most important source of information. We’ve gotten questions about deploying on Azure, on-premise Kubernetes clusters, and local machines, but by far the most common question is about GCP support.

The pain of turning away GCP users has gotten us to question whether starting with AWS was the right call — though, to be fair, we don’t know how many users we’d have if we decided to support GCP first.

The question is, why do so many people want to use GCP over other clouds?

GCP is attractive for machine learning engineers

GCP has several offerings that are really valuable for the machine learning community:

Google Colab is free and seems to be everywhere (we use it extensively ourselves).

GCP allows users to run deep learning workloads on TPUs.

The Google Kubernetes Engine (GKE) control plane is free, whereas Amazon’s (EKS) costs $0.20 an hour.

This isn’t to say that AWS doesn’t deliver its own advantages — EKS is a valuable component of our stack that abstracts many of the challenges of managing Kubernetes on our own — but the low cost of Google’s services (especially with their $300 in free credits considered) makes it appealing to cost-sensitive users.

This isn’t a major issue for the well-funded machine learning teams with whom we iterated on the early design of Cortex. However, after open sourcing the project, we were pleasantly surprised to see students and individual developers adopting the platform. Unfortunately, they were also struggling with their AWS bill.

Becoming cloud agnostic

Google’s cloud offerings, as well as their internal experience with running machine learning at scale, are driving many machine learning practitioners to adopt GCP over AWS.

At this point, our best guess is that AWS is still the predominant platform for production machine learning, given that it is the most popular cloud worldwide. Being #1 overall, however, obviously doesn’t mean AWS is the most popular choice among machine learning practitioners.

Our community’s interest in GCP is the push we needed to start working towards cloud agnosticism. We aim to learn from this experience and understand our ability to support multiple clouds without compromising the simplicity and stability of our platform. If it goes well, Azure will be next.