There are tutorials for everything with AI nowadays. How to do Object Detection, Image Classification, NLP, build a chatbot, etc., the list goes on.

But when I looked for information on how to properly scale AI, I found very little content. Even more surprisingly, the few resources that did exist seemed to reiterate the same few points:

build your model with a scalable framework like TensorFlow

either package it into your client (TF.js, TF Lite, TF-slim, etc.) or deploy it as a microservice with containers

I was more interested in the second portion as I had already developed a model, but I was surprised there was little detail provided on how to actually achieve this, and even fewer information regarding the shortcomings of each solutions. After researching for a few days and scaling AI at Crane.ai, I put together some more information regarding deploying, their drawbacks, and how to go about optimizing your TensorFlow model at a low level.

Packaging it into your client — it sucks!

One of the most commonly used techniques is to package the AI into your client of choice using tools like TensorFlow.js, TF Lite, or TensorFlow Slim. I won’t go into too much details about how these frameworks operate, and instead focus on their drawbacks.

Computational Power. The issue with deploying many of these models is that they require an immense amount of memory (I’m talking mobile app or browser limits, i.e. > 1–2GB RAM). Many mobile phones do not have this power, and desktop browsers will lag up the UI thread while also slowing down the user’s computer, heating it up, turning on the fan… etc.

The issue with deploying many of these models is that they require an (I’m talking mobile app or browser limits, i.e. > 1–2GB RAM). Many mobile phones do not have this power, and desktop browsers will lag up the UI thread while also slowing down the user’s computer, heating it up, turning on the fan… etc. Inference Time. When you’re running models on a device with unknown computational power, the inference time is generally also unknown; however, these aren’t GPU-powered high RAM high CPU machines, they’re mobile phones, browsers and desktop apps running on average computers. Inference with some larger models can easily take over a minute, which is a huge NO from a user experience perspective.

Stolen from a Reddit parody of XKCD 303

Large file. Unfortunately most models are stored in files that are quite large (we’re talking tens, hundreds of MB). As a result this will be slow and memory intensive to load, and increase your app bundle’s size by a large amount.

Unfortunately (we’re talking tens, hundreds of MB). As a result this will be slow and memory intensive to load, and increase your app bundle’s size by a large amount. Insecure. Unless you’re using an open source model, you’ll want to keep your AI and pretrained checkpoints relatively under wraps. Unfortunately, when you package your model with your application, not only will your inference code be vulnerable to decompilation, but your pretrained checkpoint will be inside the bundle and easily stolen .

Unless you’re using an open source model, you’ll want to keep your AI and pretrained checkpoints relatively under wraps. Unfortunately, when you package your model with your application, not only will your inference code be vulnerable to decompilation, but . Harder to update. If you update your model, you have two choices in a client. Either the user is issued an update via a centralized manager i.e. Play Store, App Store, etc. which leads to frequent large updates (pretty annoying for the users and a process that could be interrupted or never started depending on their settings), or the application itself runs a fetch for the new model checkpoint and metadata. The latter sounds much better but this also means that you’ll have to download a 100MB+ file over the user’s perhaps shaky connection; it will take a while so your app will have to be open at least in the background for the process to finish, and you will incur pretty large internet-out costs (this depends on your cloud).

Lack of trainability. Training models on new user data provides a level of personalization while improving its accuracy and building up a core, high signal dataset. Unfortunately most devices lack the computational power to train the model, and even if they did, it wouldn’t be possible to propagate the effects of training to your server or other devices running the application.

These drawbacks make deploying and maintaining a large neural network close to impossible on clients, and so we’ll strike this one out as an option for scaling our model.

Deploy it as a Cloud endpoint

XKCD 908, and 1117 is also relevant

The cloud is a powerful tool for deploying your models at scale. You can spin up environments that are perfectly customized to your needs, containerize your application, and scale horizontally instantly while providing SLA and uptime that rivals big corporations.

For most TensorFlow models, the deployment cycle is the same:

Freeze your graph into a Protobuf binary

Adjust your inference code to work with a frozen graph

Containerize your application

Add an API layer on top

The first piece is relatively simple. “Freezing” your graph involves creating a protobuf binary with all the named nodes, weights, architecture, and metadata involved with your checkpoint. This can be done via a variety of tools, the most popular being TF’s own tool to freeze any graph given an output node name. You can find out more about this technique and how to go about completing it here.

Adjusting your inference code isn’t difficult either; in most cases, your feed_dict will stay the same and the main difference will be the addition of code to load the model and perhaps the specification of an output node.

Containerization is also pretty trivial — just set up your environment in a Dockerfile (you can use a TF docker image as your base). When we start to add an API layer though, things start to get messy. There are generally two ways to go about doing this:

Deploy scaling containers that run an inference script. These containers run a script against input, the script starts a Session and runs inference, and outputs something that is piped back to you as a result. This is extremely problematic; adding an API layer that manipulates containers and pipes in and out is not easy or simple to do with most cloud providers (e.g. AWS has API Gateway but it isn’t nearly as convenient as you would expect), and it is the least efficient method you could use. The issue here is that you lose valuable time in container startup, hardware allocation, session startup, and inference . If you leave the stdin open and keep piping output instead, you’ll speed up your script but lose scalability (you are now hooked to the STDIN of this container, and it also won’t be able to take multiple requests).

These containers run a script against input, the script starts a Session and runs inference, and outputs something that is piped back to you as a result. adding an API layer that manipulates containers and pipes in and out to do with most cloud providers (e.g. AWS has API Gateway but it isn’t nearly as convenient as you would expect), and it is the method you could use. The issue here is that you lose valuable time in . If you leave the open and keep piping output instead, you’ll speed up your script but (you are now hooked to the STDIN of this container, and it also won’t be able to take multiple requests). Deploy scaling containers that run an API layer. This is much more efficient for several reasons although similar in architecture; by placing the API layer inside the containers, you mitigate most of the issues proposed earlier. While this takes a little more in resources, it is minimal and does not imply vertical scaling; it allows each container to stay running, and since the API in this case is decentralized there is no issue with hooking a specific stdin / stdout to your main request router. This means you get rid of startup time, and can maintain speed and horizontal scaling easily while serving multiple requests. You can centralize your containers with a load balancer, and use Kubernetes to guarantee nearly 100% uptime and manage your fleet. It’s simple and effective!

Deploy your fleet!

The main drawback with decentralizing your API through a container fleet is that the cost can add up to a large sum relatively quickly. This is unfortunately unavoidable with AI, although there are ways to mitigate this a little.

Reuse your sessions. Your fleet grows and shrinks proportionate to the load, so your goal here is to minimize the time it takes to run inference so that the container can free up to handle another request. One way to do this is to reuse the tf.Session and tf.Graph by storing them once initialized and passing them around as global variables ; this will remove the time it takes for TF to start up a session and build the graph, which will greatly speed up your inference tasks. This method is effective even on a single container and is widely used as a technique to minimize resource reallocation and maximize efficiency .

Your fleet grows and shrinks proportionate to the load, so your goal here is to minimize the time it takes to run inference so that the container can free up to handle another request. One way to do this is to ; this will remove the time it takes for TF to start up a session and build the graph, which will greatly speed up your inference tasks. This method is effective even on a single container and is widely used as a technique to . Cache input and if possible, output. The dynamic programming paradigm is most important in AI; by caching input you save the time needed to preprocess it or fetch it from the remote, and by caching output you save the time needed to run inference. This can be done trivially in Python, although you should ask yourself if this is correct for your use case! Often times, your model will be getting better with time, and this would greatly affect your output caching mechanism. In my own systems I like to use what I call the “80–20” rule. When a model is below 80% accuracy, I don’t cache any output. Once it hits 80%, I start caching and set the cache to expire at a certain accuracy (instead of say, at a certain point in time). This way, the output changes as the model becomes more accurate, but there is less tradeoff between performance and speed in this 80–20 mitigated cache.