by Kirill Dubovikov

Machine Learning as a Service with TensorFlow

Imagine this: you’ve gotten aboard the AI Hype Train and decided to develop an app which will analyze the effectiveness of different chopstick types. To monetize this mind-blowing AI application and impress VC’s, we’ll need to open it to the world. And it’d better be scalable as everyone will want to use it.

As a starting point, we will use this dataset, which contains measurements of food pinching efficiency of various individuals with chopsticks of different length.

Architecture

As we are not only data scientists but also responsible software engineers, we’ll first draft out our architecture. First, we’ll need to decide on how we will access our deployed model to make predictions. For TensorFlow, a naive choice would be to use TensorFlow Serving. This framework allows you to deploy trained TensorFlow models, supports model versioning, and uses gRPC under the hood.

The main caveat about gRPC is that it’s not very public-friendly compared to, for example, REST services. Anyone with minimal tooling can call a REST service and quickly get a result back. But when you are using gRPC, you must first generate client code from proto files using special utilities and then write the client in your favorite programming language.

TensorFlow Serving simplifies a lot of things in this pipeline, but still, it’s not the easiest framework for consuming API on the client side. Consider TF Serving if you need lightning-fast, reliable, strictly typed API that will be used inside your application (for example as a backend service for a web or mobile app).

We will also need to satisfy non-functional requirements for our system. If lots of users might want to know their chopstick effectiveness level, we’ll need the system to be fault tolerant and scalable. Also, the UI team will need to deploy their chopstick’o’meter web app somewhere too. And we’ll need resources to prototype new machine learning models, possibly in a Jupyter Lab with lots of computing power behind it. One of the best answers to those questions is to use Kubernetes.

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.

With Kubernetes in place, given knowledge and some time, we can create a scalable in-house cloud PaaS solution that gives infrastructure and software for full-cycle data science project development. If you are unfamiliar with Kubernetes, I suggest you watch this:

Kubernetes runs on top of Docker technology, so if you are unfamiliar with it, it may be good to read the official tutorial first.

All in all, this is a very rich topic that deserves several books to cover completely, so here we will focus on a single part: moving machine learning models to production.

Considerations

Yes, this dataset is small. And yes, applying deep learning might not be the best idea here. Just keep in mind that we are here to learn, and this dataset is certainly fun. The modeling part of this tutorial will be lacking in quality, as the main focus is on the model deployment process.

Also, we need to impress our VC’s, so deep learning is a must! :)

Image source.

Code

All code and configuration files used in this post are available in a companion GitHub repository.

Training Deep Chopstick classifier

First, we must choose a machine learning framework to use. As this article is intended to demonstrate TensorFlow serving capabilities, we’ll choose TensorFlow.

As you may know, there are two ways we can train our classifier: using TensorFlow and using TensorFlow Estimator API. Estimator API is an attempt to present a unified interface for deep learning models in a way scikit-learn does it for a set of classical ML models. For this task, we can use tf.estimator.LinearClassifier to quickly implement Logistic Regression and export the model after training has completed.

The other way we can do it is to use plain TensorFlow to train and export a classifier:

Setting up TensorFlow Serving

Serving, huh?

So, you have an awesome deep learning model with TensorFlow and are eager to put it into production? Now it’s time to get our hands on TensorFlow Serving.

TensorFlow Serving is based on gRPC — a fast remote procedure call library which uses another Google project under the hood — Protocol Buffers.

Protocol Buffers is a serialization framework that allows you to transform objects from memory to an efficient binary format suitable for transmission over the network.

To recap, gRPC is a framework that enables remote function calls over the network. It uses Protocol Buffers to serialize and deserialize data.

The main components of TensorFlow Serving are:

Servable — this is basically a version of your trained model exported in a format suitable for TF Serving to load

— this is basically a version of your trained model exported in a format suitable for TF Serving to load Loader — TF Serving component that, by coincidence, loads servables into memory

— TF Serving component that, by coincidence, loads servables into memory Manager — implements servable’s lifecycle operations. It controls servable’s birth (loading), long living (serving), and death (unloading)

— implements servable’s lifecycle operations. It controls servable’s birth (loading), long living (serving), and death (unloading) Core — makes all components work together (the official documentation is a little vague on what the core actually is, but you can always look at the source code to get a hang of what it does)

You can read a more in-depth overview of TF Serving architecture at the official documentation.

To get a TF Serving-based service up and running, you will need to:

Export the model to a format compatible with TensorFlow Serving. In other words, create a Servable. Install or compile TensorFlow Serving Run TensorFlow Serving and load the latest version of the exported model (servable)

Setting up TernsorFlow Serving can be done in several ways:

Building from source. This requires you to install Bazel and complete a lengthy compilation process

Using a pre-built binary package. TF Serving is available as a deb package.

To automate this process and simplify subsequent installation to Kubernetes, we created a simple Dockerfile for you. Please clone the article’s repository and follow the instructions in the README.md file to build TensorFlow Serving Docker image:

➜ make build_image

This image has TensorFlow Serving and all dependencies preinstalled. By default, it loads models from the /models directory inside the docker container.

Running a prediction service

To run our service inside the freshly built and ready to use TF Serving image, be sure to first train and export the model (or if you’re using a companion repository, just run the make train_classifier command).

After the classifier is trained and exported, you can run the serving container by using the shortcut make run_server or by using the following command:

➜ docker run -p8500:8500 -d --rm -v /path/to/exported/model:/models tfserve_bin

-p maps ports from the container to the local machine

maps ports from the container to the local machine -d runs the container in daemon (background) mode

runs the container in daemon (background) mode --rm removes the container after it has stopped

removes the container after it has stopped -v maps the local directory to a directory inside the running container. This way we pass our exported models to the TF Serving instance running inside the container

Calling model services from the client side

To call our services, we will use grpc tensorflow-serving-api Python packages. Please notice that this package is currently only available for Python 2, so you should have a separate virtual environment for the TF Serving client.

To use this API with Python 3, you’ll either need to use an unofficial package from here then download and unzip the package manually, or build TensorFlow Serving from the source (see the documentation). Example clients for both Estimator API and plain TensorFlow are below:

Going into production with Kubernetes

If you have no Kubernetes cluster available, you may create one for local experiments using minikube, or you can easily deploy a real cluster using kubeadm.

We’ll go with the minikube option in this post. Once you have installed it ( brew cask install minikube on Mac) we may start a local cluster and share its Docker environment with our machine:

➜ minikube start...➜ eval $(minikube docker-env)

After that, we’ll be able to build our image and put it inside the cluster using

➜ make build_image

A more mature option would be to use the internal docker registry and push the locally built image there, but we’ll leave this out of scope to be more concise.

After having our image built and available to the Minikube instance, we need to deploy our model server. To leverage Kubernetes’ load balancing and high-availability features, we will create a Deployment that will auto-scale our model server to three instances and will also keep them monitored and running. You can read more about Kubernetes deployments here.

All Kubernetes objects can be configured in various text formats and then passed to kubectl apply -f file_name command to (meh) apply our configuration to the cluster. Here is our chopstick server deployment config:

Let’s apply this deployment using the kubectl apply -f chopstick_deployment.yml command. After a while, you’ll see all components running:

➜ kubectl get allNAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGEdeploy/chopstick-classifier 3 3 3 3 1d

NAME DESIRED CURRENT READY AGErs/chopstick-classifier-745cbdf8cd 3 3 3 1d

NAME AGEdeploy/chopstick-classifier 1d

NAME AGErs/chopstick-classifier-745cbdf8cd 1d

NAME READY STATUS RESTARTS AGEpo/chopstick-classifier-745cbdf8cd-5gx2g 1/1 Running 0 1dpo/chopstick-classifier-745cbdf8cd-dxq7g 1/1 Running 0 1dpo/chopstick-classifier-745cbdf8cd-pktzr 1/1 Running 0 1d

Notice that based on the Deployment config, Kubernetes created for us:

Deployment

Replica Set

Three pods running our chopstick-classifier image

Now we want to call our new shiny service. To make this happen, first we need to expose it to the outside world. In Kubernetes, this can be done by defining Services. Here is the Service definition for our model:

As always, we can install it using kubectl apply -f chopstick_service.yml . Kubernetes will assign an external port to our LoadBalancer, and we can see it by running

➜ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEchopstick-classifier LoadBalancer 10.104.213.253 <pending> 8500:32372/TCP 1dkubernetes ClusterIP 10.96.0.1 <none> 443/TCP 1d

As you can see, our chopstick-classifier is available via port 32372 in my case. It may be different in your machine, so don’t forget to check it out. A convenient way to get the IP and port for any Service when using Minikube is running the following command:

➜ minikube service chopstick-classifier --urlhttp://192.168.99.100:32372

Inference

Finally, we are able to call our service!

python tf_api/client.py 192.168.99.100:32372 1010.0Sending requestoutputs { key: "classes_prob" value { dtype: DT_FLOAT tensor_shape { dim { size: 1 } dim { size: 3 } } float_val: 3.98174306027e-11 float_val: 1.0 float_val: 1.83699980923e-18 }}

Before going to real production

As this post is meant mainly for educational purposes and has some simplifications for the sake of clarity, there are several important points to consider before going to production: