One mantra we chant frequently is “trust the data”. In the context that we use this expression it is often wise. For example: when requested for the facility to adjust the rules of a robustly tested machine-learnt model so that it better jibes with intuition; or when tempted to cherry-pick fields and features which one assumes (be it through years of domain experience or otherwise) enshrine the relevant information.

This doesn't mean that the data is always right of course.

Certainly weeding out certain kinds of systematic error from the data is essential: a common example of this might be wholesale differences between records created before and after the point that a database was migrated or a new process was introduced.

For the purpose of this article though I’d like to focus on the (I think) much more intriguing case of “random” error.

A Story From The Front Line

Recently a client was using the ForecastThis platform to build a content classification model. The data consisted of millions of records of web content, each accompanied by one of a large number of fine-grained categorizations assigned by one or more expert human annotators. The client wanted to use this “ground truth” data to train a model which could annotate future such content automatically. It was imperative that the model could do this with an equivalent degree of accuracy.

Due to i) the quality and richness of the text content, ii) the strength and variety of the Natural Language Processing algorithms on hand, and iii) the relatively clear and intuitive nature of the target categories, we were a little surprised at just how poor the performance of the model recommended by our automated data scientist was in this particular case.

Our suspicions were aroused by the fact that the system had opted in the end for a surprisingly simple machine learning algorithm: a Nearest Centroid approach, which effectively just considers the average and standard deviation of the observations associated with each category in the data.

No decision trees, no neural networks, no boosting! Is this what the state of the art looks like?

Sanity Check

Wondering whether something somewhere was broken, we decided to perform a sanity check.

We manually annotated a small handful of the test cases using our best judgement, in order to compare to the client’s “ground truth” test data. The purpose was to ascertain some kind of upper bound on the expected performance: after all, this was a task which was supposedly exemplarized by human judgement; if our own (admittedly non-expert) human judgements disagreed substantially with the ground truth data, and/or with each other, then perhaps it was too much to expect anything approaching perfect performance from a set of algorithms, however sophisticated.

Indeed, what we found was that the agreement was very low: in fact roughly in line with the estimated performance of our best model, which it was now clear was making a pretty heroic effort to make sense of an ill-defined problem.

We could have stopped there and concluded that this was probably the best performance the client could expect from a machine learning (or any) approach given the nature of their data.

However — almost out of curiosity — we compared our own human annotated judgements with the output of the model.

There was almost perfect agreement!

Trust The Algorithms

What had happened here?

It was not that the underlying problem was hard. Rather it was that the labels in the client’s data (the human-supplied “ground truths” which our system was trying to model) were bad… very bad indeed. It was essentially as if somebody had gone through and replaced half of the labels with categories drawn from a hat (there are various anecdotal examples to draw upon here, about sex toys being classified as transport and so on, but I need to keep this on track).

This suddenly explained precisely why our platform had opted for such a simple algorithm: many other algorithms would have attempted to some extent (and failed to the same extent) to fit rules to the “noise” (i.e. to map sex toys to transport according to the wisdom of one example, only to be thwarted by the fact that they are labelled as garden tools elsewhere). By scarcely even trying to model these inconsistencies the Nearest Centroid algorithm was able to see straight through them (by simply averaging all the observations for each category, random factors more-or-less cancel themselves out).

Most critically, the thoroughness of the search and cross-validation methods employed — even though working from inherently bad data — ensured that this algorithm would quickly rise to the top of the heap!

The upshot is that the model that our system found in the first instance was actually markedly better than the original data (to the extent that it could reasonably be used to clean — i.e. to remove noise from — that data).

The reason this triumph wasn't immediately apparent to us was because of our flawed assumption that the data was good: all of our post-hoc evaluations of the “goodness” of the model were based on the corrupt “ground truth” data, and on the impossible — and undesirable — requirement of reproducing this data verbatim!

Lessons Learned

Never assume that your data is correct. By all means know that it is, but don’t assume it. Don’t trust the client on this count — to do so might be to do them a disservice. Regardless of how noisy and inconsistent the bulk of your data may be, make sure that you have a small sample of “sanity-check” data that is exceptionally good, or at least whose limitations are very well understood. If the test data is solid, any problems with the training data — even if rife — may prove inconsequential. Without solid test data, you will never know. Do not force (even by mild assumption) the use of sophisticated algorithms and complex models if the data does not support them. Sometimes much simpler is much better. The problem of overfitting (building unnecessarily complex models which serve only to reproduce idiosyncrasies of the training data) is well documented, but the extent of this problem is still capable of causing surprise! Let the algorithms — in collaboration with the data — speak for themselves. It follows that the algorithm and parameters the provide the best solution (when very rigorously cross-validated) can actually provide an indication of the quality of your data or the true complexity of the process that it embodies. If a thorough comparison of all the available algorithms suggests Nearest Centroid, Naive Bayes, or some Decision Stump, it is a good indication that the dominant signal in your data is a very simple one. In situations like this machine learning algorithms can actually be used to clean the source data, if there’s a business case for it. Again, a small super-high-quality test set is essential in order to validate the efficacy of this cleaning process.

This article was first published on Analytic Bridge, November 16 2014