AWS SageMaker is a cloud machine learning SDK designed for speed of iteration, and it’s one of the fastest-growing toys in the Amazon AWS ecosystem. Since launching in late 2017 SageMaker’s growth has been remarkable — last year’s AWS re:Invent stated that there are now over 10,000 companies using SageMaker to standardize their machine learning processes. SageMaker allows you to to use a Jupyter notebook interface to launch and tear down machine learning processes in handfuls of lines of Python code, something that makes data scientists happy because it abstracts away many of the messy infrastructural details to training. The thesis: standing up your own machine learning algorithm should always be this easy!

SageMaker has two APIs: a high-level API for working with a variety of pre-optimized machine learning libraries (like MXNet, TensorFlow, and scikit-learn), and a low-level API that allows running completely custom jobs where anything goes. Any library and any API you can fit into a Docker image can be used with SageMaker, and this approach has some notable advantages:

Access to algorithms and libraries that don’t come pre-installed Amazon.

More flexibility than Amazon’s pre-builds. Pre-built algorithms are configured using a JSON specification, a strategy which has limitations.

Less vendor lock-in. JSON configs are unlikely to be common across cloud providers.

Compatibility with familiar local tools. You can train a model locally using your local machine or on a local Docker image, then tweak that image to work in SageMaker with minimal effort.

Even if you’re completely comfortable sticking with the pre-built SageMaker algorithms, building and deploying your own algorithm is nevertheless valuable because it shows you about what SageMaker is doing under the hood — which will come in handy whenever it comes time to debug a model run.

This article shows you how.

We will build a simple fully custom image classifier, using a project structure that you can very easily lift and try using yourself for your own projects. Whether you’re totally new to the platform or have played with the SageMaker before, you should walk away from this article equipped with the knowledge necessary to run your own custom machine learning jobs and custom machine learning job infrastructure on top of AWS.

I recommend following along both here and by looking at the demo code on GitHub. If you are completely new to AWS SageMaker, a good way to get familiar with its feature-set is by reading the launch announcement. For a deeper evaluation I also highly recommend “Digging into AWS SageMaker — First Look” by philarmour.

Note that to run the code samples in this article for yourself you will need to change the paths in the demo to point to an Amazon S3 bucket you have permissions to — s3://quilt-example just happens to be a bucket we use for testing internally.

First let’s understand how SageMaker works

Like many other advanced Amazon SDKs SageMaker builds on top of other AWS services you may already be familiar with — specifically Amazon S3(blob storage), Amazon ECR (Docker registry), and Amazon EC2 (compute).

The rest of the article will cover deploying the custom model in four steps:

Wrapping a model training pipeline in a SageMaker-compatible Docker image Uploading that image to Amazon ECR. Using the Python API to schedule that pipeline as a job on an EC2 cluster, saving the model artifacts produced by that run to S3. Consuming that freshly trained estimator as either (1) a service or (2) a batch job.

Without further ado let’s get started.

Step 1: Writing the image

Since SageMaker machine learning training jobs are managed using Docker image, the first step to running the job is building the container.

If you are unfamiliar with Docker you should pause here and read “A Beginner Friendly Introduction to Containers, VMs, and Docker”.

When SageMaker launches a training image it injects a handful of files and environment variables from the estimator definition. The full list of resources injected is provided in the documentation. AWS uses these context clues to configure pre-built algorithm runs, but it doesn’t require that custom training jobs do the same, so you can ignore these until you actually need them.

SageMaker also in turn it expects the image to write outputs to specific places inside the container:

/opt/ml/output/failure . If the training job fails, AWS SageMaker recommends writing the reason why to this file, however this is completely optional.

. If the training job fails, AWS SageMaker recommends writing the reason why to this file, however this is completely optional. /opt/ml/model , This directory is expected to contain a list of model artifacts created by the training job. AWS SageMaker will automatically harvest the files in this folder at the end of the training run, tar them, and upload them to S3.

With that in mind, let’s examine an example Docker image that’s SageMaker compatible. Starting with the Dockerfile :

The SageMaker job runner requires that your container image define an ENTRYPOINT using the exec syntax (e.g. ENTRYPOINT ["some", "commands"] ), not the shell syntax (e.g. ENTRYPOINT some command ), as it needs to be able to send SIGTERM and SIGKILL signals to the container.

Additionally, when executing the container the SageMaker job runner will pass a run-time argument. If it is running the container in training model this will be train ; if the image is being deployed to an endpoint this will be deploy . If you use the same image for both training your model and for deploying it, you will need to parse this argument to check which mode the container is executing in.

That’s it — that’s the full list of restrictions SageMaker places on your image configuration!

Here’s the entry-point for this demo Dockerfile , run.sh . Ignore the else clause for now, we will come back to that later.

Personally when experimenting with a new model I usually find it most convenient to do my model training in a Jupyter notebook. Here training a new version of a model meaning re-running a build.ipynb notebook in-place, with nbconvert doing all of the heavy lifting. Note that I pass it the --ExecutePreprocessor.timeout=1 argument to disable to process timeout, which might otherwise interrupt our model build before it’s finished.

I will omit the example model-building notebook itself from this post as it is somewhat besides the point (but you can leaf through it out on GitHub). Basically the notebook loads some input data (using the Quilt T4 API, but you may also use raw boto3 ) and then trains and validates a simple convolutional neural network (CNN) based on that data. The notebook ends with the following code cell:

clf.save('clf.h5')

!cp clf.h5 /opt/ml/model/clf.h5

This creates a model artifact (using the save utility function in the keras library) and then copies that artifact to the /opt/ml/model path for export to S3.

Step 2: Upload the image to Amazon ECR

Now that we’ve written a SageMaker-compatible image definition the next step is registering the image with Amazon’s container registry service, Amazon ECR, so that SageMaker has access to it. Here’s how we do it (using the contents of the quilt-sagemaker-demo folder, which you get when you clone the repository):

If you’re not familiar with the aws CLI this script is admittedly a little overwhelming. Here’s what it does, step-by-step:

Gets your region and account number to construct the home ECR registry for your account: "${account}.dkr.ecr.${region}.amazonaws.com . Generates a fullname for your image, which consists of your ECR home registry string plus the name and tag of the image you are about to push. Logs you into ECR. Builds the image locally, then pushes it to ECR.

When running this script yourself, make sure that the environment you are running it in has upload access to ECR. The default SageMaker cloud notebook environment for example does not have this.

Step 3: Train the model

It’s now time to train the model.

The best way to interact with SageMaker jobs programmatically is using the sagemaker Python API. You can install it from PyPi by running the usual pip install sagemaker command. Then you run the following to actually fit the model:

Training flows through the sagemaker.estimator.Estimator object, which is parameterized with the image URI, the AWS role and session information used to authorize the run, and the number (>1 == distributed training) and type of EC2 instances to be used for training job. Running the fit method will spin up new EC2 instance(s), load the training image, and the launch a docker build ${image} train job on the image.

Notice the output_path parameter. Recall from earlier that after completing a container run, SageMaker will harvest any model files written to the /opt/ml/model path in the image, tar them, and write them to S3. Use output_path to specify where that artifact should go.

Now that the model is defined it’s time to deploy it. AWS SageMaker currently supports two kinds of deployment: deploying your model as a web endpoint, and using your model to perform a batch job. We’ll cover deploying a web endpoint in this section, then look at how batch jobs work in the next one.

Step 4a: Deploy the model as a web service

Now that the model is defined it’s time to deploy it. AWS SageMaker currently supports two kinds of deployment: deploying your model as a web endpoint, and using your model to perform a batch job. We’ll cover deploying a web endpoint in this section, then look at how batch jobs work in the next one.

Web service deployment jobs are managed the same way training jobs are: using a Docker image of your choice, this time with the deploy argument passed at runtime. When AWS SageMaker receives a deployment request, it starts by spinning up the EC2 instance (or instances) and loading in the container. It then uncompresses and injects the model artifact you generated earlier to the /opt/ml/model path in the container.

Again using the sagemaker Python API:

This code instantiates a new Model using a model artifact, a Docker image, and some auth configuration. Recall that the demo image we built handles both train and deploy arguments; hence why we reuse it here. For more complex models you will likely want to use separate train and deploy images.

The image that you deploy is expected to respond to input on port 8080 on two paths: /invocations and /ping .

The /invocations path is what does the actual work of servicing requests. It should accept POST requests from users, run your model on the payload (presumably by uncorking the model artifact stored at ${PATH} ), then respond with predictions and a status code of 200 ( OK ). There is no restriction on your input and output types.

/invocations have a timeout, which is 60 seconds by default. In other word, if your model takes longer than 60 seconds to respond your request will time out with an error! You may configure the timeout to be higher than 60 seconds if you’d like, but for long-running jobs you should use the batch prediction feature instead (covered in the next section).

The other path is /ping , which should accept GET requests and act as a health check for your service. At startup time, the EC2 job runner uses /ping to determine when the container is ready to service requests; your service will not become available until /ping starts to return status 200. The HTML body may be empty, and you may implement whatever logic you’d like, including simply always returning status 200, so long as it executes within 2 seconds. The container must succeed a /ping check within 30 seconds of container startup, otherwise SageMaker will cancel the build and declare it a failure. This means that all of your pre-serving configuration must succeed within 30 seconds (unless you raise this limit in config).

Once the container is running and serving, the SageMaker job runner will continue to run /ping every five seconds. Failing a /ping check will temporarily mark your service unavailable.

So, without further ado, here’s a simple app.py using flask that implements all of the necessary logic:

In our example image this app is ultimately what is run whenever we deploy . Once deployment succeeds you can use the SageMaker API to make requests (granted they have the right authorization, of course) using predictor.predict(your_data) . For example:

import pandas as pd

X_test = (

pd.read_csv("./fashion-mnist_train.csv")

.head()

.iloc[:, 1:]

.values

)

input = "

".join(

[",".join(l) for l in X_test.astype('str').tolist()]

)

predictor.predict(input) b'4,

9,

4,

0,

3'

Finally there will come a time when it is time to tear the endpoint down. To do so, run sess.delete_endpoint(predictor.endpoint) .

Step 4b: Deploy the model as a batch job

The alternative to deploying the model you’ve built as a web endpoint is deploying it as a batch job. You can use the same image you use for deploying a web endpoint for deploying a batch job as well, but instead of using the image to stand up a web endpoint SageMaker will use it to run the model on your input.

Batch predictions allow for passing large amounts of data through your model, as they do not have the 60-second timeout present on web endpoint -based prediction jobs. Prediction jobs start off by passing an S3 object you specify to your algorithm, and they finish by writing the sum of all predictions to another file (the path to which is dependent on your model). Thus Amazon S3 acts as both the read source and write sink for batch prediction.

Conclusion

Over the course of this article we saw how SageMaker custom builds allows us to train a flexible machine learning model, then run that model online as either a web endpoint or a batch prediction job. And you are now hopefully well-equipped for running your own machine learning model builds through AWS!

As I alluded to in the introduction, the major practical benefit of AWS SageMaker is that it handles your compute for you, allowing you to dash off jobs and stand up experimental services with an abundance of ease. SageMaker can even be used to run in production — though that’s a whole other much more complicated bag of worms that we won’t cover here.

Feel free to use the simple image we built in this article as a template for your own custom SageMaker machine learning models!

Addendum

There is one more thing worth noting. Revisiting run.sh , the container ENTRYPOINT and the script that actually gets executed when we train or deploy our model:

Recall that SageMaker injects our data into the image at runtime. If you try to run the image locally, that file will not auto-magically appear and our container won’t work. Even if you only ever intend to deploy on AWS, the inability to usefully deploy your image locally is a huge pain.