Computational cost of neural networks

We want a network that can run at, say, 15 fps so that live video looks passably smooth for the user. The problem is, we don’t know ahead of time how big that network can be and I don’t want to go through a lengthly trial and error process. It can takes a long time to build networks, convert them, compile them into an app, and benchmark them. Instead, it’d be great if we had a way of predicting runtime from the structure itself.

There are two things that determine neural network runtime: memory access and processor cycles. Loading input data and model parameters into RAM or VRAM isn’t free, but it only needs to be done once. For larger models, processor time dominates the runtime, so we’ll focus on that first.

Under the hood, neural networks are just series of matrix (tensor) multiplications. These multiplications are floating point operations (FLOPs) like adding and subtracting numbers which computers have been performing…forever. Neural networks can require hundreds of millions, sometimes billions of FLOPs to complete each forward pass. To estimate the execution time of a prediction, then, we just need to figure out how many FLOPs it’s going to take then measure how runtime changes with computation cost.

Counting FLOPs isn’t especially hard, but it’s a bit tedious and requires some basic knowledge of linear algebra. TensorFlow provides some functions to help calculate the number of FLOPs required by each layer, but they aren’t easy to work with. Instead, I ended up re-implementing the calculations myself so that I didn’t have to work with the TensorFlow graph itself. Let’s look at a simple fully connected Dense layer with no activation function as an example:

dense_output = dot(input, kernel) + bias

Assuming that our inputs are two dimensional matrices, the dot product amounts to the matrix multiplication you probably saw at some point in math class. If we multiply an N x K matrix with an K x M matrix, the result is N x M. Each element of the result of the requires K sums and K products to compute:

In other words, There are N x M elements that cost 2K floating point operations each for a total of N x M x 2K. Adding in the bias term to each element of the result is another N x M operations.

We can do the same procedure to calculate the number of FLOPs required of each layer type. The total FLOPs required for a feed forward pass of a network is just the sum of the individual layers. For our style transfer network (which takes 256px x 256px images as input), this comes out to 12.9 GFLOPs.

Estimating Runtime from FLOPs

Now that we can compute the FLOP cost of each network, lets generate some networks of different costs and benchmark them. I used Keras and it’s automatic input shape inference to generate 200 random neural networks all taking a 500px x 500px image as input. I converted these networks to Core ML using coremltools and benchmarked each using the functions above. Here is what the data looks like.

Pretty neat!

I just fit a simple linear regression (the red line) to the data. When we get above 10 GFLOPs the estimates can get a bit inaccurate which I think is related to memory access. In a future post, I’ll explore how the specific layer types contribute to runtime beyond just FLOPs and see if things improve.

For some more context, you can see the number of FLOPs required for a whole bunch of standard networks here. VGG16 tips the scales at 16 GFLOPs (~342ms runtime) while InceptionV3 is lighter at 6 GFLOPs (~132ms) and SqueezeNet lighter still at 0.3 GFLOPs (~24ms).

My artistic style transfer model requires 12.9 GFLOPs and was benchmarked at 360 ms, 28% higher than estimated 280ms, but still close. It looks like we’ll need to get it down to 2.5 GFLOPs to use with video. That’s for another post.