In Supervised learning input data or training examples come with a label, and the goal of learning is to be able to predict the label for new, unforeseen examples. Labeling the data is expensive and error prone. Data quality issues can lead to “garbage in, garbage out” in machine learning.

For example retinal images are used to develop automated diagnostic systems for conditions, such as diabetic retinopathy, age-related macular degeneration and retinopathy of prematurity. In order to do that we need annotated images labeled by various conditions structurally. Same with CT images as well. This is a rather time consuming task wherein it requires identification of very small structures and usually takes hours for an expert to carefully annotate them making it very expensive to annotate a decent size of labeled images. We need several experts to label the same image to ensure correctness of the diagnosis, and hence acquiring a dataset for the given medical task would be several times the amount it takes to annotate a single image. The problem is even more difficult in traditional enterprise settings because of data sparsity, data quality and lack of domain experts.

To solve the problem of cost and scalability there are many techniques and in this series we will talk about:

1. Pre-training models / transfer learning

2. Weak supervision

3. Active Learning.

Pre-training/transfer learning:

The idea behind pre-training is to train a neural network on a cheap and large dataset in a related domain or with noisy data in the same domain. This will solve the cold start problem by bootstrapping the network with rough idea of the data and generally in this first pass the accuracy of the results may be not be high. The parameters of the neural network are further optimized on a much smaller and expensive dataset pertaining to the domain problem. Using a pre-trained network generally makes sense if the tasks or datasets have something in common.

CNNs are used as a feature extractor and the last fully connected layer is removed from rest of the CNN as a fixed feature extractor for the new dataset. The network is re-trained on the new dataset and weights are fine-tuned by continuing the back-propagation.

This approach of transfer learning works very well and has produced great documented results in computer vision. It can also be adapted to other domains using different kinds of data such as sensor data, business process data, language data etc., We are currently working on general language modeling tasks combined with domain driven noisy labeled data for Q&A for support engineers in data center domain.

In the next blog we will present an approach to take advantage of existing annotations when the data are similar but the label sets are different. This approach was based on label embeddings which reduces the setting to a standard domain adaptation problem.

Weak Supervision :

Weak Supervision is programmatically generating training data using heuristics, rules-of-thumb, existing databases, ontologies, etc. It is often known as distant supervision or self supervision.

The idea of Weak Supervision for information extraction is not new. Craven and Kumlien (1999) introduced the idea by matching the Yeast Protein Database (YPD) to the abstracts of papers in PubMed and training a naive-Bayes extractor. Hoffmann et al. (2010) describe a system which dynamically generates lexicons in order to handle sparse data, learning over 5000 Infobox relations with an average F1 score of 61%. Yao et al. (2010) perform weak supervision, while using selectional preference constraints to a jointly reason about entity types. Another notable mention is NELL System (Never-Ending Language Learner). NELL system instead of learning a probabilistic model, bootstraps a set of extraction patterns using semi-supervised methods for multitask learning.

Snorkel system is noteworthy here and has been gaining a lot of traction. As part of DAWN project, Snorkel enables users to train models without hand labeling any training data. The users define labeling functions using arbitrary heuristics. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of their recently proposed machine learning paradigm, data programming.