Deploying memory-hungry deep learning algorithms is a challenge for anyone who wants to create a scalable service. Cloud services are expensive in the long run. Deploying models offline on edge devices is cheaper, and has other benefits as well. The only disadvantage is that they have a paucity of memory and compute power.

This blog explores a few techniques that can be used to fit neural networks in memory-constrained settings. Different techniques are used for the “training” and “inference” stages, and hence they are discussed separately.

Training

Certain applications require online learning. That is, the model improves based on feedback or additional data. Deploying such applications on the edge places a tangible resource constraint on your model. Here are 4 ways you can reduce the memory consumption of such models.

1. Gradient Checkpointing

Frameworks such as TensorFlow consume a lot of memory for training. During a forward pass, the value at every node in the graph is evaluated and saved in memory. This is required for calculating the gradient during backprop.

Value of every node is saved after a forward pass to calculate the gradient in a single backward pass. (Source)

Normally this is okay, but when models get deeper and more complex, the memory consumption increases drastically. A neat sidestep solution to this is to recompute the values of the node when needed, instead of saving them to memory.

Recomputing the node values to calculate the gradient. Note that we need to do several partial forward passes to complete a single backward pass. (Source)

However, as shown above, the computational cost increases significantly. A good trade-off is to save only some nodes in memory while recomputing the others when needed. These saved nodes are called checkpoints. This drastically reduces deep neural network memory consumption. This is illustrated below:

The second node from the left is a checkpoint node. It reduces memory consumption while providing a reasonable time penalty. (Source)

2. Trade speed for memory (Recomputation)

Extending on the above idea, we can recompute certain operations to save time. A good example for this is the Memory Efficient DenseNet implementation.

A dense block in a DenseNet. (Source)

DenseNets are very parameter efficient, but are also memory inefficient. The paradox arises because of the nature of the concatenation and batchnorm operations.

To make convolution efficient on the GPU, the values must be placed contiguously. Hence, after concatenation, cudNN arranges the values contiguously on the GPU. This involves a lot of redundant memory allocation. Similarly, batchnorm involves excess memory allocation, as explained in this paper. Both operations contribute to a quadratic growth in memory. DenseNets have have a large number of concatenations and batchnorms, and hence they are memory inefficient.

Comparing naive concat and batchnorm operations to their memory efficient counterparts. (Source)

A neat solution to the above involves two key observations.

Firstly, concatenation and batchnorm operations are not time intensive. Hence, we can just recompute the values when needed, instead of storing all the redundant memory. Secondly, instead of allocating “new” memory space for the output, we can use a “shared memory space” to dump the output.

We can overwrite this shared space to store output of other concatenation operations. We can recompute the concatenation operation for gradient calculation when needed. Similarly, we can extend this for the batchnorm operation. This simple trick saves a lot of GPU memory, in exchange for slightly increased compute time.

3. Reduce Precision

In an excellent blog, Pete Warden explains how neural networks can be trained with 8-bit float values. There are a number of issues that accrue because of reduction in precision, some of which are listed below:

As stated in this paper, “activations, gradients, and parameters” have very different ranges. A fixed-point representation would not be ideal. The paper a claims that a “dynamic fixed point” representation would be an excellent fit for low precision neural networks.

As stated in Pete Warden’s other blog, lower precision implies a greater deviation from the exact value. Normally, if the errors are totally random, they have a good chance of canceling each other out. However, zeros are used extensively for padding, dropout, and ReLU. An exact representation of zero in the lower precision float format may not be possible, and hence might introduce an overall bias in the performance.

4. Architecture Engineering for Neural Networks

Architecture engineering involves designing the neural network structure that best optimizes accuracy, memory, and speed. There are several ways by which convolutions can be optimized space-wise and time-wise.

Factorize NxN convolutions into a combinations of Nx1 and 1xN convolutions. This conserves a lot of space while also boosting computational speed. This and several other optimization tricks were used in newer versions of the Inception network. For a more detailed discussion, check out this blog post.

Use Depthwise Separable convolutions as in MobileNet and Xception Net. For an elaborate discussion on the types of convolutions, check out this blog post.

Use 1x1 convolutions as a bottleneck to reduce the number of incoming channels. This technique is used in several popular neural networks.

Illustration of Google’s AutoML. (Source)

An interesting solution is to let the machine decide the best architecture for a particular problem. Neural Architecture Search uses machine learning to find the best neural network architecture for a given classification problem. When used on ImageNet, the network formed as a result (NASNet) was among the best performing models created so far. Google’s AutoML works on the same principle.