Deploying deep learning model for production is no easy task. Usually, models are prepared by people with math and statistics background, who are not savvy in coding production grade, 24/7, low latency systems. And software engineers, who have experience in coding and supporting such systems usually have no idea what deep learning model is and how to deploy it. If there is a strict requirement for low latency and high throughput, deploying such a system can become a real headache for the whole team.

Luckily developers of deep learning frameworks understood that research is not the only destination and day of deploying will come. Let’s talk about Tensorflow. It provides means for model deploy — Tensorflow Serving. It’s nice API for model deploy and also gives some boost in inference speed. But what if you need more speed, more throughput or more efficient hardware utilization? For some time there was one painful way — use TensorRT 2.0. TensorRT is a low-level library, it’s as close to Nvidia hardware as possible (TensorRT is developed by Nvidia). Take no offense, it’s a great library, but it’s completely C++ library. Usually, people who have DL skills love Python and don’t like C++, people who love C++ give all their love to C++ and don’t learn new hypish things like DL. TensorRT provides a number of model optimization for inference such as layer and tensor fusion, precision calibration, kernel auto-tuning and others. All this results in an impressive boost of inference speed:

Now, suppose your team has both, DL engineer, who knows how to build model and C++ engineer, who knows how to build sleek efficient C++ code. Now everything depends on the following question: how to export model build by DL engineer and use it in C++ code build by SW engineer. TensorRT 2.0 has support for Caffe framework. But who uses Caffe this days? Some people do. And with DIGITS environment it has a certain appeal. But most DL people use Tensorflow or PyTorch or other 2nd generation frameworks. Here you got a not very pleasant task converting Tensorflow graph and checkpoint into protobuf graph definition and model weights used in Caffe.

But, with new 3.0 release, it’s all solved! TensorRT 3.0 has a python interface and more. The new unified format for neural networks is introduced — UFF. Models now can be trained in Tensorflow, exported into UFF and used by TRT3. Facebook is trying to push ONXX, but believe me, UFF is far ahead and deserves your attention.

So, what this post is about. First I’ll tell you how to get all new stuff. Then I’ll walk through a small example of how to export Tensorflow model into UFF. Then I’ll walk through an example of how to do inference in python using TensorRT 3 (For people who like wrapping model with python code and serving it as REST API it would be very useful).