It easier than ever before to train a neural network. However, it is rarely the case that you can just take code from a tutorial and directly make it work for your application. Interestingly, many of the most important tweaks are barely discussed in the academic literature but at the same time critical to make your product work.

Applying deep learning to real-world problems can be messy (source: pinsdaddy.com).

Therefore I thought it would be helpful for other people who plan to use deep learning in their business to understand some of these tweaks and tricks.

In this blog post I want to share three key learnings, which helped us at Merantix when applying deep learning to real-world problems:

Learning I: the value of pre-training

Learning II: caveats of real-world label distributions

Learning III: understanding black box models

A little disclaimer:

This is not a complete list and there are many other important tweaks.

Most of these learnings apply not only to deep learning but also to other machine learning algorithms.

All the learnings are industry-agnostic.

Most of the ideas in the post refer to supervised learning problems.

This post is based on my talk I gave on May 10 at the Berlin.AI meetup (the slides are here).

Learning I: the value of pre-training

In the academic world of machine learning, there is little focus on obtaining datasets. Instead, it is even the opposite: in order to compare deep learning techniques with other approaches and ensure that one method outperforms others, the standard procedure is to measure the performance on a standard dataset with the same evaluation procedure. However, in real-world scenarios, it is less about showing that your new algorithm squeezes out an extra 1% in performance compared to another method. Instead it is about building a robust system which solves the required task with sufficient accuracy. As for all machine learning systems, this requires labeled training from which the algorithm can learn from.

For many real-world problems it is unfortunately rather expensive to get well-labeled training data. To elaborate on this issue, let’s consider two cases:

Medical vision: if we want to build a system which detects lymph nodes in the human body in Computed Tomography (CT) images, we need annotated images where the lymph node is labeled. This is a rather time consuming task, as the images are in 3D and it is required to recognize very small structures. Assuming that a radiologist earns 100$/h and can carefully annotate 4 images per hour, this implies that we incur costs of 25$ per image or 250k$ for 10000 labeled images. Considering that we require several physicians to label the same image to ensure close to 100% diagnosis correctness, acquiring a dataset for the given medical task would easily exceed those 250k$. Credit scoring: if we want to build a system that makes credit decisions, we need to know who is likely to default so we can train a machine learning system to recognize them beforehand. Unfortunately, you only know for sure if somebody defaults when it happens. Thus a naive strategy would be to give loans of say 10k$ to everyone. However, this means that every person that defaults will cost us 10k$. This puts a very expensive price tag on each labeled datapoint.

Obviously there are tricks to lower these costs, but the overall message is that labeled data for real-world problems can be expensive to obtain.

How can we overcome this problem?

Pre-training

Pre-training helps (source: massivejoes.com).

The basic idea of pre-training is that we first train a neural network (or another machine learning algorithm) on a cheap and large dataset in a related domain or on noisy data in the same domain. Even though this will not directly solve the original problem, it will give the neural network a rough idea of what your prediction problem looks like. Now, in a second step, the parameters of the neural network are further optimized on a much smaller and expensive dataset of the problem you are actually trying to solve. This two-step procedure is depicted in the figure below.

When training data is hard to get: first pre-train the neural network on a large and cheap dataset in a related domain; secondly, fine-tune it on an expensive well-labeled dataset. This will result in a performance boost compared to just training on the small dataset.

When fine-tuning, the number of classes might change: people often pre-train a neural network on a dataset like ImageNet with 1000 classes and then fine-tune it to their specific problem which likely has a different number of classes. This means the last layer needs to be re-initialized. The learning rate is then often set a bit higher on the last layer as it needs to be learned from scratch, whereas the previous layers are trained with a lower learning rate. For some datasets like ImageNet the features (the last fully connected layer) learned are so generic that they can be taken off-the-shelf and directly be used for some other computer vision problem.

How do we obtain data for pre-training?

Sources of data for pre-training

Pre-trained models: there are lots of trained models on the web. The first go to point are the so-called Model Zoos. These are websites which contain a collection of various trained models by academics, companies and deep learning enthusiasts. See here, here, or here. Public datasets: there are many datasets out there on the web. So don’t waste time on collecting the dataset yourself, but rather check if there is already something out there that might help solving the particular problem you’re working on. See here, here, or here. Data crawling: if there is neither a public pre-trained model nor dataset, there might be a cheeky way to generate a dataset without labeling it by hand. You can build a so-called crawler which automatically collects them from specific websites. This way you create a new dataset.

Sources of data for pre-training.

Weakly labeled data

As we fine-tune on precisely labeled data, it is possible to pre-train on so-called weakly labeled data. By this we refer to data which labels are not in all cases correct (i.e. 90% of the labels might be correct and 10% wrong). The advantage is that this kind of data can often be obtained without any human involved in labeling but automatically. This makes this data relatively cheap compared to data where a human needs to label every single image. To give an example: during my PhD, I crawled a dataset of 500k face images from Wikipedia and IMDb. We combine the date of birth of a person in the profile and any hint in the caption of the photos when it was taken. This way we can assign an approximate age to each image. Note that in some cases the year in the caption below the image might have been wrong or the photo might show several people and the face detector selected the wrong face. Thus we cannot guarantee that in all cases the age label is correct. Nonetheless we showed that pre-training on this weakly labeled dataset helped to improve the performance versus just training on a precisely labeled smaller dataset.

A similar logic can be applied to the medical vision problem where it is required to have several doctors independently label the same image in order to be close to 100% sure that the labeling is correct. This is the dataset for fine-tuning. Additionally, one can collect a larger dataset with weak labels which was annotated by just one person. Thereby, we can reduce the total cost for labeling and still make sure that the neural network has been trained on a diverse set of images.

In summary, increasing performance doesn’t necessarily mean that you need human annotations which are often expensive but you might be able to get a labeled dataset for free or at substantially lower costs.

Learning II: caveats of real-world label distributions

Real-world distributions (source: r4risk.com.au).

Now that we have obtained data both for pre- and fine-tuning, we can move on and start training our neural networks. Here comes another big difference between academia and real world.

In academia, the datasets are mostly balanced. That means for supervised classification problems there are usually equally many samples per class. Below you find two examples: MNIST is a very known dataset of handwritten digits containing approximately equally many samples of each digit. Food 101 is another example of an academic dataset which contains exactly 1000 images of each of the 101 food categories.