When you have a team of data scientists and engineers working on a model, things can get messy. Data is constantly being copied across machines, tweaks are being made on an ad hoc basis, and eventually, you end up with a model you cannot explain or reproduce.

Data Version Control (DVC) solves this problem. Using a version control workflow that will be immediately familiar to anyone who has used Git, DVC stores your model weights and training data in a centralized location, allowing collaborators to get started easily, while also tracking changes and ensuring an accurate version history.

In this tutorial, we’re going to use DVC to create a model capable of analyzing StackOverflow posts, and recognizing which ones are about Python. We are then going to deploy our model as a web API, ready to form the backend of a piece of production software.

While DVC makes your machine learning experiments reproducible, to deploy your model as a production backend, we’ll need another tool. As a final step in this tutorial, we’re going to integrate DVC with another open source tool—Cortex—that allows us to deploy DVC-generated models as web APIs, ready for production.

Let’s start with setting up our DVC project.

Step 1. Set up your DVC project

To begin, fork and clone this completed DVC project into your project directory. Note that it’s important that you fork the repo—you’re going to be pushing changes.

Next, we’ll install DVC and pull in our data, stored by DVC in its remote caches:

$ dvc pull

In your cloned project directory, you’ll notice that you have many .dvc files. DVC creates one for each stage in your pipeline. Each .dvc file contains a reference to the code and data used at that stage. To better visualize the stages of your pipeline, you can run dvc pipeline show evaluate.dvc :

$ dvc pipeline show evaluate.dvc data/data.xml.dvc

prepare.dvc

featurize.dvc

train.dvc

evaluate.dvc

In the next step, we’re going to use these .dvc files to export our model to a remote storage platform.

Step 2. Export and upload your model

By running dvc repro STAGENAME.dvc , you can reproduce any experiment up to that stage. This means we don’t have to go step-by-step through the stages of this pipeline. Instead, we’ll just reproduce the entire experiment by running dvc repro on the last stage.

$ dvc repro evaluate.dvc

DVC will run every stage up to the evaluate.dvc , and export a new model.pkl and pipeline.pkl file.

Now, in order to get our model deployed as an API, we need to get our model.pkl and pipeline.pkl files into a remote platform (we’ll be using S3). Fortunately, DVC has a built-in function for this.

Similar to Git, all you need to do to upload your DVC-generated data to a remote platform is to define a remote URL, and push to it:

$ dvc remote add -d aws s3://your-bucket/

$ dvc push

DVC should immediately begin uploading your files to your S3 bucket. If you don’t want to run through this step yourself, you can use a bucket we’ve set up at s3://cortex-examples/dvc

DVC also tracks where your S3 bucket is located in a hidden config file called .dvc/config , so in order for things to work, we need to update our Git repository with our new DVC remote. To do this, simply run:

$ git add --all

$ git commit

$ git push

Step 3. Deploy your model with Cortex

With the model stored in S3 and our git repo updated, we can finally deploy our model as an API.

There are many ways to deploy a model, but for this tutorial, we’re going to use Cortex because of its ease of implementation. We’ll just need to run one command from the terminal, and most of the infrastructure work needed to deploy, scale, and monitor our API will be done for us.

First, you’ll have to install Cortex if you haven’t already. Once Cortex is installed and configured, you’re ready to deploy. In our situation, Cortex requires three files in order to successfully deploy:

A predictor.py file that loads your model and serves predictions.

file that loads your model and serves predictions. A requirements.txt file that lists predictor.py 's dependencies.

file that lists 's dependencies. A cortex.yaml file that configures your deployment.

All of those files are included in your cloned repo, but we’re going to briefly touch on some key points.

Let’s start with predictor.py . This file has pretty basic logic—all you need to do is initialize a model, and define a predict() method that takes a query and returns a response.

DVC stores our files in S3 within data files, which we’ll use the DVC library to parse. Doing this within our predictor.py looks like this:

Nothing too crazy—we just need the DVC library to parse the paths we need out of the DVC data files. You can see the entirety of predictor.py in your repo, or here.

The requirements.txt file is a straightforward list of our dependencies, so we won’t go in depth here, but cortex.yaml deserves some attention too.

cortex.yaml is Cortex’s standard config file. It serves as a blueprint for configuring your deployment:

Replace the dvc_repo URL with your git repo URL.

With those files in place, all you have to do is run cortex deploy from your command line, and your API should spin up:

$ cortex deploy creating tagger

You can check on your deployment at any time by running cortex get :

$ cortex get api status up-to-date available requested

tagger live 1 1 1

Assuming your API is live, you can hit it using any technology capable of serving HTTP clients. We’ll just text it with curl for now, and we’ll hit our endpoint with the sample.json file included in your repo, which looks like this:

And our curl request will look like this:

$ cortex get tagger url: http://***.us-west-2.elb.amazonaws.com/tagger/stackoverflow $ curl http://***.us-west-2.elb.amazonaws.com/tagger/stackoverflow \

-X POST -H "Content-Type: application/json" \

-d @sample.json "python"

And voila—you have a functioning StackOverflow tagging API.

Free and open source, from experiment to production

With the way your pipeline is now set up, every time you reproduce your experiment, assuming you made some changes that resulted in a new model, you can update your remote data with dvc push . Then, all you have to do is run cortex deploy and Cortex will update your model using rolling updates—meaning no downtime for your API.

In essence, your entire pipeline from experimentation to deployment is now abstracted away into a few simple commands, using entirely free and open source software.