Too big to deploy: How GPT-2 is breaking servers

A look at the bottleneck around deploying massive models to production

The most optimistic of us envision a future in which machine learning is capable of human-level tasks—driving our cars, answering our calls, booking our appointments, responding to our emails.

Reality, of course, is different. Modern production machine learning has only effectively tackled very tightly scoped problems—recommending your next show on Netflix or calculating your ETA.

When OpenAI released GPT-2, however, it felt like that gap began to close.

By simply dialing up the size of their model (GPT-2 has a whopping 1.5 billion parameters, over 10x ELMo’s, a previous state of the art transformer model, 93.6 million), OpenAI built a general language model that—while sometimes imperfect—could be convincingly human:

And GPT-2 wasn’t an outlier. Shortly after its release, Salesforce released CTRL, a 1.6 billion parameter language model. NVIDIA built Megatron, an 8 billion parameter transformer model. Just this week, Google released Meena, a state of the art conversational model with 2.6 billion parameters.

Even in computer vision, the path to better performance seems to run through bigger models. In the summer of 2018, just months before GPT-2 was first published, Google released NASNet, a record breaking image classification model with 88.9 million parameters—bigger than any other major image classification model—capable of identifying objects in images:

The trend is clear. To reach our rosy vision of a machine learning powered future, these “super models” are going to get bigger and bigger. There is just one problem:

They’re too big to serve in production.

What’s so challenging about serving super models?

As models continue to balloon in size, deploying them to production gets trickier and trickier. Take GPT-2 as an example:

GPT-2 is > 5 GB . Locally embedding the model into an application—the way mobile software often uses machine learning—isn’t an option at that size.

. Locally embedding the model into an application—the way mobile software often uses machine learning—isn’t an option at that size. GPT-2 is compute hungry . In order to serve a single prediction, GPT-2 can occupy a CPU at 100% utilization for several minutes. Even with a GPU, a single prediction can still take seconds. Compare this to a web app, which can serve hundreds of concurrent users with one CPU.

. In order to serve a single prediction, GPT-2 can occupy a CPU at 100% utilization for several minutes. Even with a GPU, a single prediction can still take seconds. Compare this to a web app, which can serve hundreds of concurrent users with one CPU. GPT-2 is memory hungry. Beyond its considerable disk space and compute requirements, GPT-2 also needs large amounts of memory to run without crashing.

In other words, GPT-2 is big, resource intensive, and slow. Putting it into production at all is a challenge, and scaling it is even harder.

These problems aren’t unique to GPT-2. They are universal to all super models, and will only get worse as models get bigger. Fortunately, there are some projects in the machine learning ecosystem that are picking off pieces of this obstacle.

How we’re solving the super model problem

It’s still early days, but there are three general efforts to solve the super model problem:

1. Make the models smaller

One of the more obvious places to start. If models are getting too big, why don’t we compress them?

One way to do this is through knowledge distillation. At a very high level, the idea is that a small model—called the student—can emulate the performance of a large model—the parent—by studying it.

In other words, training GPT-2 required feeding it 40 GB of text, which is equivalent to a text file of about 27,118,520 pages. Training a distilled GPT-2 model, however, only requires you to feed it the outputs of GPT-2.

HuggingFace, the company behind the famous Transformers NLP library, did just this to create DistilGPT2. While DistilGPT2 scores a few points lower on some quality benchmarks than the full GPT-2 model, it is 33% smaller and twice as fast.

A 2x increase in speed is a huge deal. For a self-driving car, it’s the difference between a safe stop and a fender bender. For a conversational agent, it’s the difference between a natural conversation and an infuriating robo call.

You can actually compare the performance of DistilGPT2 and GPT-2 with HuggingFace’s interactive Write With Transformers editor:

2. Deploy the models to the cloud

Even with distillation, however, models are still pretty big. A 33% reduction on a model that is over 25 GB (NVIDIA’s Megatron is 5.6x the size of GPT-2) still leaves you with a behemoth of a model.

At this size, the devices we use to consume ML-generated content—our phones, TVs, even our computers—can’t be responsible for hosting the models. They simply won’t fit.

One solution is to deploy the models to the cloud as microservices, which our devices can query as needed. This is referred to as realtime inference, and is the standard way for deploying large models in production.

Deploying in the cloud, however, has its own problems—particularly at scale.

As an example, let’s look at AI Dungeon, the popular choose your own adventure game built on GPT-2:

Because of GPT-2’s size and compute requirements, AI Dungeon can only serve a couple of users from a single deployed model. To handle increases in traffic, AI Dungeon needs to automatically scale up.

Horizontally scaling GPT-2 deployments is tricky. It requires you to:

Ensure each deployment is identical, e.g. by containerizing your model with Docker and orchestrating your containers with Kubernetes.

Auto scale your deployments, e.g. by configuring your cloud vendor’s auto scaler to automatically spin instances up and down depending on traffic.

Optimize your resources, which means finding the cheapest instance type and resource allocation you can run without sacrificing performance.

Done wrong, you could easily rack up a giant cloud bill—deploying 200 g4dn.2xlarge instances costs $150.40 an hour—or find yourself with a prediction serving API that constantly crashes.

In other words, to serve your big models, you currently need to know quite a bit about devops—and most data scientists are not simultaneously infrastructure engineers.

Fortunately, there are projects working to remove this bottleneck.

Open source projects like Cortex, the project behind AI Dungeon’s infrastructure, have gained traction as tools designed to automate the devops work required to deploy large models:

Full disclosure: I am a Cortex contributor

3. Accelerating the model serving hardware

The last bucket of efforts to make it easier to serve big models doesn’t have anything to do with the models themselves. Instead, it has to do with improving the hardware.

Bigger models perform better on different hardware. In fact, as I explained in my piece on why GPU’s matter for model serving, the only way to serve GPT-2 with a low enough latency for a feature like autocorrect is with GPUs:

The average person types 40 words per minute. The average English word has roughly 5 characters. An average person, as a result, types 200 characters per minute, or 3.33 characters per second. Taken one step further, this means there is roughly 300 ms between each character an average person types. If you’re running on CPUs, taking 925 ms per request, you’re way to slow for Gmail’s Smart Compose. By the time you process one of a user’s characters, they’re roughly 3 characters ahead — even more if they’re a fast typer. With GPUs, however, you’re well ahead of them. At 199 ms per request, you’ll be able to predict the rest of their message with about 100 ms to spare — which is useful when you consider their browser still needs to render your prediction.

As models get bigger, however, we need even more processing power.

Some efforts to solve this problem involve building entirely new hardware. Google, for example, has released TPUs, which are ASICs specifically made for interfacing with TensorFlow. Google’s newest TPUs have recently broken records for scalability and performance on model serving benchmarks. AWS has similarly recently announced their own specialized Inferentia chip.

Other efforts involve accelerating and optimizing existing hardware. NVIDIA, for example, has released TensorRT, an SDK for optimizing NVIDIA GPU utilization in inference serving. NVIDIA has already documented a 40x performance increase over CPU-only inference using TensorRT on GPUs.

Machine learning will become commonplace

In many respects, the machine learning space still feels like the wild west.

Super models like GPT-2 have just begun to emerge, machine learning is just becoming broadly accessible to engineers—not just major companies, and it seems like a new breakthrough in model architecture is constantly just around the corner.

However, we’re already seeing machine learning functionality crop up in nearly every vertical, from media to finance to retail. In the surprisingly near future, there will scarcely be a product that doesn’t involve machine learning in some way.

As machine learning becomes a standard part of software, the challenges of deploying huge models in production will similarly become commonplace.