3. Optimizing the downsampling

For a fixed number of layers and a fixed number of pooling operations, a neural network can behave quite differently. This comes from the fact that the representation of the data, as well as the computational load, depend on where the pooling operation is done:

When the pooling operation is done early, the dimensionality of the data is reduced. Fewer dimensions mean faster processing through the network, but also mean less information, thus poorer accuracy.

When the pooling operation is done late in the network, most of the information is preserved, giving high accuracy. However, this also means that the computations are made on objects with many dimensions, and are more computationally expensive.

However, this also means that the computations are made on objects with many dimensions, and are more computationally expensive. Evenly spacing the downsampling throughout the neural network works as an empirically effective architecture, and offers a good balance between accuracy and speed. Again, this is a kind of saturation point.

Early pooling is fast, late pooling is accurate, evenly spaced pooling is a bit of both.

4. Pruning the weights

In a trained neural network, some weights strongly contribute to the activation of a neuron, while others barely influence the result. Nonetheless, we still do some computation for these weak weights.

Pruning is the process of completely removing the connections with the smallest magnitude so that we can skip the calculations. This lowers the accuracy but makes the network both lighter and faster. We need to find the saturation point so that we remove as many connections as possible, without harming the accuracy too much.

The weakest connections are removed to save computing time and space.

5. Quantizing the weights

To save the network on disk, we need to record the value of every single weight in the network. This means saving one floating point number for each parameter, and this represents a lot of space taken on the disk. For reference, in C a float occupies 4 bytes i.e 32 bits. A network with parameters in the hundreds of millions (such as GoogLe-Net, or VGG-16) can easily take hundreds of megabytes, which is unacceptable on a mobile device.

To keep the network footprint as small as possible, one way is to lower the resolution of the weights by quantizing them. In this process, we change the representation of the number so that it can no longer take any value, but is rather constrained to a subset of values. This enables us to only store the quantized values once, and then reference them for the weights of the network.

Quantizing the weights stores keys instead of floats.

Again, we’ll determine how many values to use by finding the saturation point. More values means more accuracy, but also a larger representation. For example, by using 256 quantized values, each weight can be referenced using only 1 byte i.e 8 bits. Compared to before (32 bits), we have divided the size by 4!

6. Encoding the representation of the model

We’ve already fiddled with the weights a lot, but we can improve the network even more! This trick relies on the fact that the weights are not evenly distributed. Once quantized, we don’t have the same number of weights bearing each quantized value. This means that some references will come up more often than others in our model representation, and we can take advantage of that!

Huffman coding is the perfect solution for this problem. It works by attributing the keys with the smallest footprint to the most used values, and the keys with the largest footprint to the least used values. This helps reduce the size of the model on the device, and the best part is that there’s no loss in accuracy.

The most frequent symbol only uses 1 bit of space while the least frequent uses 3 bits. This is balanced by the fact that the latter rarely appears in the representation.

This simple trick allows us to shrink the space taken by the neural network even further, usually by around 30%.

Note: The quantization and encoding can be different for each layer in the network, giving more flexibility.

Correcting the accuracy loss

With the tricks we’ve used, we’ve been pretty rough on our neural network. We’ve removed weak connections (pruning) and even changed some weights (quantization). While this makes the network super light and blazing fast, the accuracy isn’t what it used to be.

To fix that, we need to iteratively retrain the network at each step. This simply means that after pruning or quantizing the weights, we train the network again so that it can adapt to the change and repeat this process until the weights stop changing too much.

Conclusion

Although smartphones don’t have the disk space, computing power, or battery life of good old fashioned desktop computers, they are still very good targets for deep learning applications. With a handful of tricks, and at the cost of a few percentage points of accuracy, it’s now possible to run powerful neural networks on these versatile handheld devices. This opens the door to thousands of exciting applications.

If you’re still curious, take a look at some of the best mobile-oriented neural networks, like SqueezeNet or MobileNets.

Discuss this post on Hacker News