Quantization in TF-Lite

Floating-point vs Fixed-point

First, a quick primer on floating/fixed-point representation. Floating point uses a mantissa and an exponent to represent real values — and both can vary. The exponent allows for representing a wide range of numbers, and the mantissa gives the precision. The decimal point can “float”, i.e. appear anywhere relative to the digits.

If we replace the exponent by a fixed scaling factor, we can use integers to represent the value of a number relative to (i.e. an integer multiple of) this constant. The decimal point’s position is now “fixed” by the scaling factor. Going back to the number line example, the value of the scaling factor determines the smallest distance between 2 ticks on the line, and the number of such ticks is decided by how many bits we use to represent the integer (for 8-bit fixed point, 256 or 28). We can use these to tradeoff between range and precision. Any value that is not an exact multiple of the constant will get rounded to the nearest point.

Quantization Scheme

Unlike floating-point, there is no universal standard for fixed-point numbers, and is instead domain-specific. Our quantization scheme (mapping between real & quantized numbers) requires the following:

It should be linear (or affine).

If it isn’t, then the result of fixed-point calculations won’t directly map back to real numbers. It allows us to always represent 0.f accurately.

If we quantize and dequantize any real value, only 256 (or generally, 2^B) of them will return the exact the same number, while all others will suffer some precision loss. If we ensure that 0.f is one of these 256 values , it turns out that DNNs can be quantized more accurately. The authors claim that this improves accuracy because 0 has a special significance in DNNs (such as padding). Besides, having 0 map to another value that’s higher/lower than zero will introduce a bias in the quantization scheme.

So our quantization scheme will simply be a shifting and scaling of the real number line to a quantized number line. For a given set of real values, we want the minimum/maximum real values in this range [rmin,rmax] to map to the minimum/maximum integer values [0,2^B-1] respectively, with everything in between linearly distributed.

This gives us a pretty simple linear equation:

Here,

r is the real value (usually float32 )

) q is its quantized representation as a B-bit integer ( uint8 , uint32 , etc.)

, , etc.) S ( float32 ) and z ( uint ) are the factors by which we scale and shift the number line. z is the quantized ‘zero-point’ which will always map back exactly to 0.f .

From this point, we’ll assume quantized variables are represented as uint8 , except where mentioned. Alternatively, we could also use int8 , which would just shift the zero-point, z.

The set of numbers being quantized with the same parameters are values we expect to lie in the same range, such as weights of a given layer or activation outputs at a given node. We’ll see later how to find the actual ranges for various quantities in TensorFlow’s fake quantization nodes. First, let’s put this together to see how these quantized layers fit in a network.

A typical quantized layer

Let’s look at the components of a conventional layer implemented in floating-point:

Zero or more weight tensors, which are constant, and stored in float.

One or more input tensors; again, stored in float.

The forward pass function which operates on the weights and inputs, using floating point arithmetic, storing the output in float

Output tensors, again in float.

Because the weights of a pre-trained network are constant, we can convert and store them in quantized form beforehand, with their exact ranges known to us.

The input to a layer, and equivalently the output of a preceding layer, are also quantized with their own separate parameters. But wait — to quantize a set of numbers, don’t we need to know their range (and thus their actual values) in float first? Then what’s the point of quantized computation? The answer to this lies behind the fact that a layer’s output generally lies in a bounded range for most inputs, with only a few outliers.

While we ideally would want to know the exact range of values to quantize them accurately, results of unknown inputs can still be expected to be in similar bounds. Luckily, we’re already computing the output in float during another stage — training. Thus, we can find the average output range on a large number of inputs during training and use this as a proxy to the output quantization parameters. When running on an actual unseen input, an outlier will get squashed if our range is too small, or get rounded if the range is too wide. But hopefully there will only be a few of these.

What’s left is the main function that computes the output of the layer. Changing this to a quantized version requires more than simply changing float to int everywhere, as the results of our integer computations can overflow.

So we’ll have to store results in larger integers (say, int32) and then requantize it to the 8-bit output. This is not a concern in conventional full-precision implementations, where all variables are in float and the hardware handles all the nitty-gritties of floating-point arithmetic. Additionally, we’ll also have to change some of the layers’ logic. For example, ReLU should now compare values against Quantized(0) instead of 0.f

The below figure puts it all together.