Google announced new tooling for their TensorFlow Lite deep-learning framework that reduces the size of models and latency of inference. The tool converts a trained model's weights from floating-point representation to 8-bit signed integers. This reduces the memory requirements of the model and allows it to run on hardware without floating-point accelerators and without sacrificing model quality.

In a recent blog post, the TensorFlow team described the approach which uses integer quantization to convert the model weights to 8-bit integers. The toolkit also includes tensor operations that use only integer arithmetic, which provide the best runtime performance on hardware that has no floating-point hardware acceleration. This produces models that are "up to 2–4x faster...and 4x smaller" when compared to the baseline floating-point models, with less than 1% reduction of accuracy in prediction results.

Floating-point numbers are the most accurate way to represent weights and activations in a neural network. However, mathematical operations on these numbers require specialized hardware to achieve quick throughput. Many mobile or embedded devices lack this hardware, and so take longer to compute a result from a neural network using floating-point arithmetic. Integer quantization is a technique for converting 4-byte or 8-byte floating-point numbers into single-byte integers. These integers not only require less memory: arithmetic with integers can run quickly on any general-purpose CPU. The loss of precision does have an effect on the neural network's accuracy, but research indicates that the loss is generally small enough to be acceptable, especially when considered against the large reductions in memory footprint and execution time.

Because a single byte can only represent numbers from -128 to 127, the conversion of activations from floating-point to integers requires a calibration step to determine scaling parameters. This is done by running several examples of input data through the floating-point model. This gives an estimate of the maximum and minimum values of the activations; these extremes are mapped to 127 and -128 respectively to calculate the scaling parameters. Before this release, TensorFlow Lite used a "hybrid-quantization" scheme that only converted the model weights to 8-bits. Because floating-point computations were still required to compute the network activations, the scaling step was not needed; however, the inference latency was higher due to the floating-point computations.

TensorFlow also includes quantization-aware training as part of its "contrib" library, which is "not officially supported, and may change or be removed at any time without notice." This method outputs a model that is already optimized for integer arithmetic. However, post-training quantization "is much simpler to use, and offers comparable accuracy on most models." However, the developer site for Arm devices notes that "it is not currently possible to deploy 8-bit quantized TensorFlow models via CoreML on iOS."

Facebook's PyTorch, another major deep-learning framework, has a similar tool called QNNPACK which was released last year. According to a blog post announcing the release, "QNNPACK-based Caffe2 operators are approximately 2x faster than TensorFlow Lite on a variety of phones."

The quantization tools are included as in the latest release of TensorFlow which is available on GitHub.