As for the network structure, standard benchmarks like ImageNet and beautiful visualisations like the two above make our job a lot easier! For speed, MobileNet-v2 and the Depthwise-Separable convolution blocks should be the default option. MobileNets also have the added advantage that they’re well-designed to run on CPU and can actually get up to real-time in some cases! Other networks that appear similar in terms of performance on GPU as MobileNets usually fall way short on the CPU. It’s a big advantage because CPU compute is significantly cheaper ($) than GPU.

The world-title for most accurate network is all tied up: SENet and NASNet are fractions of a percent apart in accuracy. If you are just going to be doing transfer learning, go with NASNet since there are a few built-in implementations for deep learning libraries, such as in Keras. When building from scratch however, SENet and the Squeeze-Excitation blocks will be significantly easier to code up in practice. For either a quick prototype, or something with a balance between speed and accuracy, a regular ResNet and the Residual Blocks will do just fine.

The last thing you might want to play around with is the image resolution. Smaller will be way faster. Bigger usually gets you higher accuracy by at the cost of quadratically increasing memory consumption and run-time.

Lessons Learned

MobileNet-v2 / Depthwise Separable convoltuions and low resolution for speed

SENet / Squeeze-Excitation or NASNet and high resolution for accuracy

A regular ResNet / Residual Blocks for a balance

Data Preprocessing and Augmentation

An often forgotten yet critical thing to consider is the data preparation, preprocessing, and augmentation. You don’t always have to do this. Before doing any processing on your data, you should first asses whether your application would actually benefit from it.

For example, in image classification the standard protocol is to mean-normalise the images based on the mean of the training data. It’s been proven many times in the research literature that mean-normalisation is a good default thing to do.

On the other hand, if you are doing an image enhancement, mean-normalisation can actually hurt the network and results quite a bit. Any task really that has to do with very fine-grained differences in things like texture, color, or appearance, rather than high-level shape and semantic differences, would probably benefit from not doing any mean-normalisation.

Data augmentation on the otherhand has been strongly proven to consistently increase network performance, both in terms of absolute accuracy and generalisation. It does so on all ranges of tasks, from high-level classification to low-level enhancement.

That being said, you should still consider which augmentations to apply. For example, if you are doing image segmentation for self-driving cars, you’re not really expecting the cars to be driving upside down! Thus, you might apply horizontal flipping but avoid vertical. Data augmentation is most appropriately used when you are actually training your final network, or just quickly seeing how much augmentation would help. Before that, you’re just experimenting and prototyping and so there’s no need to make your training time longer by having more data.

Lessons Learned

Preprocess only when needed, based on your task and using proven research as a guide

Augmentation almost always increases accuracy, just make sure you use it in reflection with the data you realistically expect to see in your application

Regularisation

Regularisation can be used whenever you feel that you are overfitting to your training data and performing poorly on testing. You can tell when you are overfitting when you see that the difference between your training and testing accuracies is quite large, with your training accuracy being much better than test.

There are several options to choose from: dropout, spatial dropout, cutout, L1, L2, adding Gaussian noise…. and many more in the sea of research papers! Practically, dropout is the easiest to use since you usually only have to put it in a couple of places and tune a single parameter. You can start by placing it just prior to the last couple of dense layers in your network. If you feel like you’re still overfitting, you can add more earlier in the network or play around with the dropout probability. This should close the gap between your training and testing accuracies.

If regular dropout fails, you can play around with the others. With things like L1 and L2, you have more tuning options and so might be able to calibrate them to do a better regularisation job than dropout. In the vast majority of cases you won’t need to combine more than 1 regularisation technique i.e try to only use 1 throughout your network.

Lesson Learned

Use dropout by default for practicality and ease of use

If dropout fails, explore some of the others which can be customised like L1 / L2

If all techniques fail, you may have a mismatch between your training and testing data

Training

When you finally want to train your network, there are several optimization algorithms to choose from. Many people say that SGD gets you the best results with regards to accuracy, which in my experience is true. However, tuning the learning rate schedule and parameters can be challenging and tedious. On the other hand, using an adaptive learning rate such as Adam, Adagrad, or Adadelta is quick and easy, but you might not get that optimal accuracy of SGD.

The best thing to do here is to follow the same “style” as the activation functions: go with the easy ones first to see if your design works well, then tune and optimize using something more complex. I would recommend starting off with Adam, as in my experience it’s super easy to use: just set a learning rate that’s not absurdly high, commonly defaulted at 0.0001 and you’ll usually get some very good results! Later on you can use SGD from scratch or even start with Adam, then fine tune with SGD. In fact, this paper found that switching from Adam to SGD mid-training achieves the best accuracy in the easiest way! Check out the figure below from the paper:

As for your data, more is nearly always better. The only thing to consider is really where you would start to get diminishing returns. For example, if you’re already at 95% accuracy, and you estimate that if you double your training data you could get to 96%, it might be time to really consider whether or not you really need that 1%, given the effort and resources needed to get there is quite high. And of course, be sure that the data you are collecting is reflective of what you will see in a real application for your task. Otherwise, no matter what algorithm you use, poorly selected data simply isn’t going to cut it.