When you apply your trained model to the validation set or to the test set, you need statistical scores to measure your performance.

In fact, in a typical supervised binary classification problem, for each element of the validation set (or test set) you have a label stating if the element is positive or negative (1 or 0, usually). Your machine learning algorithm makes a prediction for each element of the validation set, expressing if it is positive or negative, and, based upon these prediction and the gold-standard labels, it will assign each element to one of the following categories: true negatives (TN), true positives (TP), false positives (FP), false negatives (FN) (Table 1).

Table 1 The confusion matrix: each pair (actual value; predicted value) falls into one of the four listed categories Full size table

If many elements of the set then fall into the first two classes (TP or TN), this means that your algorithm was able to correctly predict as positive the elements that were positive in the validation set (TP), or to correctly classify as negative the instances that were negative in the validation set (TN). On the contrary, if you have many FP instances, this means that your method wrongly classified as positive many elements which are negative in the validation set. And, as well, many FN elements mean that the classifier wrongly predicted as negative a lot of elements which are positive in the validation set.

In order to have an overall understanding of your prediction, you decide to take advantage of common statistical scores, such as accuracy (Eq. 1), and F1 score (Eq. 2).

$$ accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$ (1)

(accuracy: worst value =0; best value =1)

$$ F1 \; score = \frac{2 \cdot TP}{2 \cdot TP+FP+FN} $$ (2)

(F1 score: worst value =0; best value =1)

However, even if accuracy and F1 score are widely employed in statistics, both can be misleading, since they do not fully consider the size of the four classes of the confusion matrix in their final score computation.

Suppose, for example, you have a very imbalanced validation set made of 100 elements, 95 of which are positive elements, and only 5 are negative elements (as explained in Tip 5). And suppose also you made some mistakes in designing and training your machine learning classifier, and now you have an algorithm which always predicts positive. Imagine that you are not aware of this issue.

By applying your only-positive predictor to your imbalanced validation set, therefore, you obtain values for the confusion matrix categories:

TP = 95, FP = 5; TN = 0, FN = 0.

These values lead to the following performance scores: accuracy = 95%, and F1 score = 97.44%. By reading these over-optimistic scores, then you will be very happy and will think that your machine learning algorithm is doing an excellent job. Obviously, you would be on the wrong track.

On the contrary, to avoid these dangerous misleading illusions, there is another performance score that you can exploit: the Matthews correlation coefficient [40] (MCC, Eq. 3).

$$ MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)\cdot(TP+FN)\cdot(TN+FP)\cdot(TN+FN)}} $$ (3)

(MCC: worst value =−1; best value =+1).

By considering the proportion of each class of the confusion matrix in its formula, its score is high only if your classifier is doing well on both the negative and the positive elements.

In the example above, the MCC score would be undefined (since TN and FN would be 0, therefore the denominator of Eq. 3 would be 0). By checking this value, instead of accuracy and F1 score, you would then be able to notice that your classifier is going in the wrong direction, and you would become aware that there are issues you ought to solve before proceeding.

Let us consider this other example. You ran a classification on the same dataset which led to the following values for the confusion matrix categories:

TP = 90, FP = 5; TN = 1, FN = 4.

In this example, the classifier has performed well in classifying positive instances, but was not able to correctly recognize negative data elements. Again, the resulting F1 score and accuracy scores would be extremely high: accuracy = 91%, and F1 score = 95.24%. Similarly to the previous case, if a researcher analyzed only these two score indicators, without considering the MCC, he/she would wrongly think the algorithm is performing quite well in its task, and would have the illusion of being successful.

On the other hand, checking the Matthews correlation coefficient would be pivotal once again. In this example, the value of the MCC would be 0.14 (Eq. 3), indicating that the algorithm is performing similarly to random guessing. Acting as an alarm, the MCC would be able to inform the data mining practitioner that the statistical model is performing poorly.

For these reasons, we strongly encourage to evaluate each test performance through the Matthews correlation coefficient (MCC), instead of the accuracy and the F1 score, for any binary classification problem.

In addition to the Matthews correlation coefficient, another performance score that you will find helpful is the Precision-Recall curve. Often you will not have binary labels (for example, true and false) for negative and the positive elements in your predictions, but rather a real value of each prediction made, in the [0,1] interval. In this common case, you can decide to utilize each possible value of your prediction as threshold for the confusion matrix.

Therefore, you will end up having a real valued array for each FN, TN, FP, TP classes. To measure the quality of your performance, you will be able to choose between two common curves, of which you will be able to compute the area under the curve (AUC): receiver operating characteristic (ROC) curve (Fig. 3 a), and Precision-Recall (PR) curve (Fig. 3 b) [41].

Fig. 3 a Example of Precision-Recall curve, with the precision score on the y axis and the recall score on the x axis (Tip 8). The grey area is the PR cuve area under the curve (AUPRC). b Example of receiver operating characteristic (ROC) curve, with the recall (true positive rate) score on the y axis and the fallout (false positive rate) score on the x axis (Tip 8). The grey area is the ROC area under the curve (AUROC) Full size image

The ROC curve is computed through recall (true positive rate, sensitivity) on the y axis and fallout (false positive rate, or 1 − specificity) on the x axis:

ROC curve axes:

$$ recall = \frac{TP}{TP+FN} \qquad \qquad \qquad fallout = \frac{FP}{FP+TN} $$ (4)

In contrast, the Precision-Recall curve has precision (positive predictive value) on the y axis and recall (true positive rate, sensitivity) on the x axis:

Precision-Recall curve axes:

$$ precision = \frac{TP}{TP+FP} \qquad \qquad \qquad recall = \frac{TP}{TP+FN} $$ (5)

Usually, the evaluation of the performance is made by computing the area under the curve (AUC) of these two curve models: the greater the AUC is, the better the model is performing.

As one can notice, the optimization of the ROC curve tends to maximize the correctly classified positive values (TP, which are present in the numerator of the recall formula), and the correctly classified negative values (TN, which are present in the denominator of the fallout formula).

Differently, the optimization of the PR curve tends to maximize to the correctly classified positive values (TP, which are present both in the precision and in the recall formula), and does not consider directly the correctly classified negative values (TN, which are absent both from the precision and in the recall formula).

In computational biology, we often have very sparse dataset with many negative instances and few positive instances. Therefore, we prefer to avoid the involvement of true negatives in our prediction score. In addition, ROC and AUROC present additional disadvantages related to their interpretation in specific clinical domains [42].

For these reasons, the Precision-Recall curve is a more reliable and informative indicator for your statistical performance than the receiver operating characteristic curve, especially for imbalanced datasets [43].

Other useful techniques to assess the statistical significance of a machine learning predictions are permutation testing [44] and bootstrapping [45].