Diagram overviewing the CI/CD deployment process with Kubernetes. Taken from Kubecon slideshow

Recently, academic and industry researchers have conducted a lot of exciting and ground-breaking research in the field of deep learning. They have developed many new models that are incredibly powerful. However, much of this research (outside of the few tech giants) remains just that research, and not part of a production application. Despite the constant flood of new papers it remains incredibly difficult to actually utilize any of these models in production (and that is even when the papers provide code). Deploying machine learning models remains a significant challenge. In this article I will provide an overview of various ways to “productionize” machine learning models and weigh their respective pros/cons without going into too much detail. In subsequent articles I will explore these approaches with actual code and examples.

The Challenge

We will assume that we have already trained the model ourselves or have the trained weights available from the internet. In order to use your model in an application you will have to:

Load your model with its weights Preprocess your data Perform the actual prediction Handle the prediction response data

Sounds simple enough? Well in practice this process can actually be quite complicated.

Like with many things there is not one clear-cut answer about the best way to deploy a machine learning model to a production environment. The questions you should ask yourself are:

What are my requirements? (i.e. how many requests are you expecting per second, what latency is required…etc) How will I evaluate the model’s performance in production? (and how will I collect and store the additional data from interactions) How frequently do I plan on re-training my model? What are the data preprocessing needs? Will the format of the production input data differ drastically from the model training data? Will it come in batches or as a stream? Does the model need to be able to run offline?

These are the basic questions that you should ask before attempting to deploy your model.

Loading model directly into application

This option essentially considers the model a part of the overall application and hence loads it within the application. This approach is easier in certain circumstances than others.

For instance, if the core application itself is written in Python the process can be smooth. Altogether it usually requires adding dependencies to setup/config files and modifying your predict function to be called through the appropiate user interactions. The model is loaded as part of the application and all dependencies must be included in the application.

This process becomes more difficult if your application is not written in Python. For instance, there is no good way to load PyTorch or Caffe into Java programs. Even Tensorflow, which has a Java library requires writing a lot of additional code to fully integrate into the application. Finally, this does not explicitly address the problem of scalibility.

However, as stated previously this route remains beneficial when you want to quickly deploy an application written in Python. It also, remains one of the better options for devices without an internet connection.

Calling an API

The second option involves making an API and calling the API from your application. This can be done in a number of different ways. I have detailed the most common ways here.

Kubernetes

Docker in many respects seems like a natural choice for deploying machine learning models. A model and all of its dependencies can be neatly packaged in one container. Moreover, the server can automatically scale-up by adding more Docker containers when needed. Kubernetes is one of the best ways to manage Docker containers and therefore good for machine learning.

Recently, Kubernetes unveiled Kubeflow which aims at bringing machine learning to the Kubernetes framework. Kubeflow attempts to make it easy to train, test, and deploy your model as well collect evaluation metrics. I plan on covering Kubeflow in a later blog post of its own as it can be quite complicated. For now just understand that it is a full scale package aimed at making it easy to develop and deploy machine learning microservices at scale.

Custom REST-API with Flask/Django

Another option is to create your own REST-API from scratch (this option could possibly be combined with Docker as well) depending on how familiar you are with making APIs. This can be done using Flask relatively easily. I will not go into more detail on how to do this as there are a number of exisiting tutorials that cover exactly this topic. Depending on the number of requests you can usually scale Flask without too much trouble.

However, even this can get quite messy due to the differences in ML frameworks, load times, and specific model preprocessing requirements. Currently, I’m working on a model agnostic instantiation class for use with Flask/Django to make it easier to use models and provide a standardized template for working with models. So instead, of having to remember and implement the different functions for different models you can call model.preprocess() and model.predict() regardless of the backend (more on this in another article as well).

AWS Lambda/Serverless

AWS Lambda is another possible route. You can read AWS’s documentation on how to set this up. There is a good article on using AWS lambda with Caffe2 in “Machine Learnings.”

Other approaches

Apache Beam — I don’t know too much about this method, but the method seems to involve using Beam to do the model preprocessing and then Tensorflow (only framework supported for now) to do the actual prediction. See these slides for more info.

Spark/Flink

Several of the major data processing frameworks such as Spark and Flink are in the process of developing packages to aid in the deployment of machine learning models. Flink has Flink Tensorflow which aims at integrating TF models into Flink streaming pipelines. With Spark there are a number of different packages and attempts at integrating deep learning. Additionally, Yahoo released its own library recently that allows both distributed training and model serving.

Conclusion

In future articles I will go into more detail about building an API with Kubeflow to serve models as well as discuss how to load deep learning models into Java programs and Flink pipelines.