Lessons learned

We have been using TensorFlow Serving in production for approximately half a year, and our experience with it has been quite smooth. Its has a good latency for prediction time. Below is a graph of the 95th percentile of prediction time in seconds for our production TensorFlow Serving instances across a week (approximately 20 milliseconds):

Nevertheless, along the journey of productionising TensorFlow Serving, there were a few lessons that we have learnt.

Joy and Pain of Model Versioning

We have had a few different versions of the TensorFlow models in production thus far, each with varying characteristics such as network architecture, training data etc. Gracefully handling the different versions of the model has been a non-trivial task. This is because the input request passed to the TensorFlow Serving often involved a number of pre-processing steps. And these preprocessing steps can vary between the TensorFlow model versions. Mismatch between the pre-processing steps and model version could potentially results in erroneous predictions.

1a. Be explicit about the version you’re after

A simple but useful way that we found for preventing erroneous predictions is to use the versions attribute specified in the model.proto definition which is optional (which compiled to model_pb2.py). This guarantees that you would always match your request payload with the expected model version.

When you request for a given version e.g. version 5 from the client, if the TensorFlow Serving server is not serving that particular version, it will return an error message indicating that the model is not found.

1b. Serving up multiple model versions

The default behavior of TensorFlow Serving is to load and serve the latest version of the model.

When we first implemented TensorFlow Serving in September 2016, it did not support serving multiple versions of the model simultaneously. This means that there’s only one version of the model served at a given time. This was not sufficient for our use case as we would like to serve multiple versions of the model to support A/B testing of different neural network architecture.

One of the options would be to run up multiple TensorFlow Serving processes on different hosts or ports such that each process serves up a different model version. This setup requires either:

the consumer applications (gRPC client) to contain switching logic and knowledge of which instance of TensorFlow Serving to call for a given version. This adds complexity to the clients and was not preferred.

a registry which maps the version to different instances of TensorFlow Serving.

A more ideal solution is for TensorFlow Serving to serve up multiple versions of the model.

I’ve decided to use one of my lab days to extend TensorFlow Serving to serve multiple versions of model. At Zendesk, we have the concept of “lab day” where we could spend 1 day in every 2 weeks to work on something that we are interested in, let it be tools that could improve our day to day productivity, or a new technology that we are keen to learn. It has been more than eight years since I last worked on C++ code. However, I am impressed at how readable and clean the Tensorflow Serving codebase is, making it easy to extend on. The enhancements to support multiple versions were submitted and have since been merged into the main codebase. TensorFlow Serving maintainers are quite prompt in providing feedback on patches and enhancements. From the latest master branch, you can start up TensorFlow Serving to serve up multiple model versions with the extra flag of model_version_policy:

/work/serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server — port=8999 — model_base_path=/work/awesome_model_directory — model_version_policy=ALL_VERSIONS

An important point to note is that there’s the trade-off with serving multiple model versions, which is higher memory usage. Therefore when running with the above flag, remember to remove obsolete model versions in the model base path.

2. Compression is Your Friend

When you are deploying a new model version, it’s recommended to compress the exported TensorFlow model files into a single compressed file before copying it into the model_base_path. The Tensorflow Serving tutorial contains steps to export a trained Tensorflow model. The exported checkpoint TensorFlow model directory generally has the following folder structures :

A parent directory that consist of a version number (0000001 for e.g.) and contains contains the following files:

saved_model.pb — the serialized model which includes the graph definition(s) of the model, as well as metadata of the model such as signatures.

— the serialized model which includes the graph definition(s) of the model, as well as metadata of the model such as signatures. variables are files that hold the serialized variables of the graphs.

To compress the exported model:

tar -cvzf modelv1.tar.gz 0000001

Why compress it?

It’s faster to transfer or copy around If you copy the exported model folder directly into the model_base_path, the copy process may take awhile and you could of ended up having export files copied but the corresponding meta file is not copied yet. If TensorFlow Serving started loading your model and is unable to detect the meta file, the server will fail to load the model and stop trying to load that particular version again.

3. Model size matters

The TensorFlow models that we have are fairly large — between 300Mb to 1.2Gb. We noticed that when the model size exceeded 64Mb, we will get an error while trying to serve up the model. This is due to a hardcoded 64Mb limit in the protobuf message size as described in the following TensorFlow Serving Github issue.

As a result, we applied the patch described in the Github issue to change the hardcoded constant value. Yuck…. :( This is still a mystery to us. Let us know if you manage to find alternative methods of allowing serving of models larger than 64Mb without changing the hardcoded source.

4. Avoid the Source Moving Underneath You

We have been building the TensorFlow Serving source from master branch as at the time of implementing, the latest release branch (v0.4) lags behind master in terms of functionalities and bug fixes. Therefore if you’re building source by checking out masters only, the source may change beneath you whenever new changes are merged into masters. To ensure repeatable builds of the artefact, we find that it’s important to checkout the specific commit revisions rather for: