For the better part of a year, OpenAI’s GPT-2 has been one of the hottest topics in machine learning — and for good reason. The text generating model, which initially was dubbed “too dangerous” to be released in full, is capable of producing uncanny outputs. If you haven’t seen any examples, I recommend looking at OpenAI’s official samples — they’re incredible.

Due in part to the machine learning community’s excitement about GPT-2, there are a ton of tools available to help you implement GPT-2 in different use cases:

Want to play with GPT-2? OpenAI has released pre-trained models.

Want to train GPT-2 with different text? Use Max Woolf’s gpt-2-simple.

Need a faster, compressed GPT-2? Use Hugging Face’s DistilGPT-2.

With all of these tools, it’s fairly trivial to get GPT-2 running locally. It is still difficult, however, to deploy GPT-2 in production.

In order to build real software with GPT-2 — from chatbots to Magic: The Gathering card generators — you’ll need to deploy your model in production. The most common way to do this is to deploy your model as a web API, queryable by your application.

In this tutorial, we’re going to deploy Hugging Face’s DistilGPT-2 as a web API on AWS. Our API is going to be built on infrastructure that handles autoscaling, monitoring, updating, and logging automatically.

Let’s get started.

1. Load Hugging Face’s DistilGPT-2

To start, we’re going to create a Python script to load our model and process responses. For the sake of this tutorial, we’ll call it predictor.py .

As you can see, Hugging Face’s Transformers library makes it possible to load DistilGPT-2 in just a few lines of code:

And now you have an initialized DistilGPT-2 model. As a side note, Hugging Face’s Transformers library makes it this easy to initialize nearly any state of the art NLP model, not just DistilGPT-2.

Within your predictor.py script, you’ll also need a function to serve predictions, which we’ll call predict() . When passed input, predict() should tokenize the input, run it through the model, decode the output, and respond with the generated text. In this scenario, our predict() function can be as simple as 6 lines of code:

There a few functions called in there that we won’t go in detail on here, but you see/copy the entirety of predictor.py here.

With our prediction-related code written, we can deploy our model.

2. Deploy DistilGPT-2 as an API

This is typically a major pain point in machine learning infrastructure. Responsibly deploying a model means implementing autoscaling, structuring updates so that they don’t break your API, monitoring your model’s performance, and handling logging.

Instead of doing all of the above by hand, we’re going to abstract it away with Cortex. You can read more about Cortex here, but essentially, it is a tool that takes a simple configuration file and uses it to automate your model deployment on AWS.

First, install Cortex. Once Cortex is installed, you can create your deployment configuration file, which should be called cortex.yaml . The file can be this light:

Once your configuration is saved, you can run simply cortex deploy from your command line. This will take the declarative configuration from cortex.yaml and create it on the cluster:

$ cortex deploy deployment started

Behind the scenes, Cortex containerizes our implementation, makes it servable using Flask, exposes the endpoint with a load balancer, and orchestrates the workload on Kubernetes.

Now, we can query our API.

3. Query your DistilGPT-2 API in real time

At any point, you can retrieve the url of your API endpoint by running cortex get generator . When you have your endpoint, you can user curl to test your deployment:

$ cortex get generator url: http://***.amazonaws.com/text/generator

-X POST -H “Content-Type: application/json” \

-d ‘{“text”: “Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.”}’ $ curl http://***.amazonaws.com/text/generator -X POST -H “Content-Type: application/json” \-d ‘{“text”: “Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.”}’ "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence.







The iFrame top was inspired by several prominent advances in machine learning \u2013 vision, machine learning, machine learning and machine learning \u2013 that were widely accepted. One well-known example was intuition \u2013 used by many computer scientists to predict which"

You can now access your endpoint from any service capable of querying it, just as you would consume any other web API.

Voila! You have DistilGPT-2 deployed as a scalable web API, and all it took was a simple configuration file.

Taking things a step further

There are dozens of ways to implement a DistilGPT-2-powered API into a software project. Want to build an autocomplete feature? How about a Chrome extension that suggests email responses? Or perhaps more functionally, a chatbot for your site?

You could even dabble with other pre-trained models, thanks to how simple Hugging Face has made it to implement them with Transformers. With Cortex, you can deploy any of them by following the above template.