Following up from my earlier blogs on training and using TensorFlow models on the edge in Python, in this seventh blog in the series; I wanted to cover a topic that’s generally not talked about enough—optimizing the performance and latency of your TensorFlow models.

Series Pit Stops

While it might not sound as fancy or exciting as other topics I’ve covered in the past, it’s one of the important ones. Running predictions with a TensorFlow model is a time- and energy-consuming process.

For example, using an edge classification model on a MacBook Pro roughly takes around 0.5 seconds for an image. This might not sound like a lot, but if you’re dealing with a lot of images, this is something that might be of concern to you and your users alike.

This post is a collection of some tips that have helped me speed up model inference by more than 80%. While I’ll be talking about models trained on Google Cloud’s AutoML, these tips should remain the same for all TensorFlow models, regardless of where the model is trained.

So without any further delay, let’s get started!

T ip 1: Balancing the size-vs-accuracy tradeoff of your model

This might seem like a no brainer, but it was surprising to see how long I had been ignoring this.

While training your model on Google Cloud, you have the option to select either the high accuracy variant (largest in size; highest latency) the low accuracy variant (smallest in size; lowest latency), and the variant with the best tradeoff between these two.

While a first instinct might be to go ahead with the model with the highest accuracy, it comes with its trade-offs; for instance, in the image above, you can see that the low accuracy model is more than 6 times faster than the model with high accuracy.

And it’s also possible that you don’t get a significant difference in the accuracy among these in the first place! For example, here’s a screenshot showcasing the accuracy that I got by training the same model for all three variants: