The problem essentially boils down to a classification problem. We have two classes that we want to distinguish between: Gibberish and Non-Gibberish. Any algorithm that can do this can be called a classifier.

One of the standard ways of assessing classifier performance in machine learning is a Receiver Operator Characteristic, or ROC curve. Most classifiers for a binary true/false case will predict a numerical result where higher is more likely to be true. This then requires a threshold value to turn that spectrum of results into a true or false value.

If the classifier separates the two types well, such that there is no overlap in the values given to the known true and known false values, then a threshold value can be placed in the middle, giving a perfect result.

However most of the time the classifier doesn’t separate the two entirely and the overlap in the values causes some misclassifications regardless of where the threshold is placed. The ROC curve plots both the rates for both types of error on an axis for all potential threshold values a, and allows us to see at a glance how well the different groups have been separated. The separation can be quantified by calculating the area under the curve, or AUC. This provides a handy number for comparing classifiers for any threshold value.

In production, however, a threshold value does need to be chosen. The standard way of measuring success of a classifier is using accuracy. This is defined as the percentage of correctly classified results, i.e (True Positive + True Negative) / (True Positive + True Negative + False Positive + False Negative). You can see the definitions of these for our specific case below.

Accuracy has some problems in our case though. It’s far more important for us to avoid misclassifying data as gibberish and removing valuable information than letting some gibberish slip through. With accuracy we have no way of knowing what type of errors our classifier makes. Because of this, we chose to use a different metric.

Positive Predictor Value (PPV) measures when the classifier says something is gibberish how often it is correct i.e. (True Positive) / (True Positive + False Positive). There is a problem with only judging performance based on this value though, as if the classifier only predicts gibberish once out of hundreds of pieces of gibberish and is correct it would have a PPV of 100%, this is called overfitting. To prevent overfitting when picking threshold values and training we decided to use maximum accuracy as is standard but, crucially, use PPV to compare different algorithms and to measure how successful we have been. After some discussion, we decided that a PPV of 90% was a good criteria for success.

On to the actual algorithms…