Model Selection

The best way to end up with a smaller, more efficient model is to start with one. The graph above plots the rough size (in megabytes) of various model architectures. I’ve overlaid lines denoting the typical size of mobile applications (code and assets included), as well as the amount of SRAM that might be available in an embedded device.

The logarithmic scale on the Y-axis softens the visual blow, but the unfortunate truth is that the majority of model architectures are orders of magnitude too large for deployment anywhere but the larger corners of a datacenter.

Incredibly, the smaller architectures to the right don’t perform much worse than the large ones to the left. An architecture like VGG-16 (300–500MB) performs about as well as a MobileNet (20MB) model, despite being nearly 25X smaller.

What makes smaller architectures like MobileNet and SqueezeNet so efficient? Based on experiments by Iandola et al (SqueezeNet), Howard et al (MobileNetV3), and Chen et al (DeepLab V3), some answers lie in the macro- and micro-architectures of models.

Macro-architecture refers the types of layers used by a model and how they are arranged into modules and blocks. To produce efficient macro-architectures:

Keep activation maps large by downsampling later or using atrous (dilated) convolutions

Use more channels, but fewer layers

Use skip connections and residual connections to improve accuracy and re-use parameters during calculation

Replace standard convolutions with depthwise separable ones

A model’s micro-architecture is defined by choices related to individual layers. Best practices include:

Making input and output blocks as efficient as possible, as they are typically 15–25% of a model’s computation cost

Decreasing the size of convolution kernels

Adding a width multiplier to control the number of channels per convolution with a hyperparameter, alpha

Arranging layers so that parameters can be fused (e.g. bias and batch normalization)

Model Training

After a model architecture has been selected, there’s still a lot that can be done to shrink it and make it more efficient during training. In case it wasn’t already obvious, most neural networks are over-parameterized. Many trained weights have little impact on overall accuracy and can be removed. Frankle et al find that in many networks, 80–90% of network weights can be removed — along with most of the precision in those weights — with little loss in accuracy.

There are three main strategies for finding and removing these parameters: knowledge distillation, pruning, and quantization. They can be applied together or separately.

Knowledge Distillation

Knowledge distillation uses a larger “teacher” model to train a smaller “student” model. First conceived by Hinton et al in 2015, the keys to this technique are two loss terms: one for the hard predictions of the student model and a second based on the ability of the student to produce the same distribution of scores across all output classes.

Polino et al were able to achieve a 46X reduction in size for ResNet models trained on CIFAR10 with only 10% loss in accuracy, and a 2X reduction in size on ImageNet with only a 2% loss in accuracy. More recently, Jiao et al distilled BERT to create TinyBERT: 7.5X smaller, 9.4X faster, and only 3% less accurate. There are a few great open source libraries with implementations of distillation frameworks including Distiller and Distil* for transformers.

Pruning

The second technique to shrink models is pruning. Pruning involves assessing the importance of weights in a model and removing those that contribute the least to overall model accuracy. Pruning can be done at multiple scales in a network. The smallest models are achieved by pruning at the individual weight level. Weights with small magnitudes are set to zero. When models are compressed or stored in a sparse format, these zeros are very efficient to store.

Han et al use this approach to shrink common computer vision architectures by 9–13X with negligible changes in accuracy. Unfortunately, a lack of support for fast sparse matrix operations means that weight-level pruning doesn’t also increase runtime speeds.

To create models that are both smaller and faster, pruning needs to be done at filter or layer levels—for example, removing the filters of a convolution layer that contribute least to overall prediction accuracy. Models pruned at the filter level aren’t quite as small but are typically faster. Li et al were able to reduce the size and runtime of a VGG model by 34% with no loss in accuracy using this technique.

Finally, it’s worth noting that Liu et al have shown mixed results as to whether it’s better to start from a larger model and prune or train a smaller model from scratch.

Quantization

After a model has been trained, it needs to be prepared for deployment. Here, too, there are techniques to squeeze even more optimizations out of a model. Typically, the weights of a models are stored as 32-bit floating point numbers, but for most applications, this is far more precision than necessary. We can save space and (sometimes) time by quantizing these weights, again with minimal impact on accuracy.

Quantization maps each floating point weight to a fixed precision integer containing fewer bits than the original. While there are a number of quantization techniques, the two most important factors are the bit depth of the final model and whether weights are quantized during or after training (quantization-aware training and post-training quantization, respectively).

Finally, it’s important to quantize both weights and activations to speed up model runtime. Activation functions are mathematical operations that will naturally produce floating point numbers. If these functions aren’t modified to produce quantized outputs, models can even run slower due to the necessary conversion.

In a fantastic review, Krishnamoorthi tests a number of quantization schemes and configurations to provide a set of best practices:

Results: