Metrics computed from the confusion matrix

First we’ll parse the obtained confusion matrix into True Positives(TP) , True Negatives (TN) , False Positives (FP) , and False Negatives (FN) .

# True Positives

TP = confusion[1, 1] # True Negatives

TN = confusion[0, 0] # False Positives

FP = confusion[0, 1] # False Negatives

FN = confusion[1, 0]

We can calculate the following metrics from the confusion matrix.

Classification accuracy

Classification accuracy is the ratio of correct predictions to the total no. of predictions. Or more simply, how often is the classifier correct.

Fig — Accuracy

We can calculate the accuracy using the confusion matrix. Following is the equation to calculate the accuracy using the confusion matrix:

Fig — Accuracy using Confusion Matrix

Accuracy can also be calculated using the method accuracy_score . We can observe that the accuracy is 0.795.

print((TP + TN) / float(TP + TN + FP + FN))

print(accuracy_score(y_test, y_pred)) OUTPUT :

0.795580110497

0.795580110497

Sensitivity/Recall

Sensitivity or recall is the ratio of correct positive predictions to the total no. of positive predictions. Or more simply, how sensitive the classifier is for detecting positive instances. This is also called the True Positive Rate .

Fig — Recall

Using the confusion matrix recall can be calculated as follows:

Fig — Recall using Confusion Matrix

Also, Scikit-learn provides a method called recall_score to find the recall score. We can observe that the classifier has a recall score of 0.58.

print(TP / float(TP + FN))

print(recall_score(y_test, y_pred)) OUTPUT :

0.58064516129

0.58064516129

Specificity

Specificity is the ratio of correct negative predictions to the total no. of negative predictions. This determines how specific the classifier is in predicting positive instances.

Fig — Specificity

We can calculate specificity using the confusion matrix as follows.

Fig — Specificity using Confusion Matrix

print(TN / float(TN + FP)) OUTPUT :

0.90756302521

False Positive Rate

The false positive rate is the ratio of negative predictions that were determined to be positive to the total number of negative predictions. Or, when the actual value is negative, how often is the prediction incorrect.

Fig — False Positive Rate

This can be calculated using the confusion matrix as follows:

Fig — False Positive Rate using Confusion Matrix

print(FP / float(TN + FP)) OUTPUT :

0.0924369747899

Precision

Precision is the ratio of correct predictions to the total no. of predicted correct predictions. This measures how precise the classifier is when predicting positive instances.

Fig — Precision

This can be calculated from the confusion matrix as follows:

Fig — Precision using Confusion Matrix

Scikit -learn provides the method precision_score to calculate precision. We can observe that the precision is 0.76.

print(TP / float(TP + FP))

print(precision_score(y_test, y_pred)) OUTPUT :

0.765957446809

0.765957446809

Confusion matrix advantages:

Variety of metrics can be derived.

Useful for multi-class problems as well.

NOTE : Choosing which metric to use depends on the business objective or the nature of the problem.

Adjusting Classification Threshold

It’s possible to adjust the logistic regression model’s classification threshold to increase the model’s sensitivity.

After training, the model exposes an attribute called predict_proba , which returns the probability of the test data being in a particular response class. From this, we’ll get the probabilities of predicting a diabetic result.

# store the predicted probabilities for class 1 (diabetic) y_pred_prob = logreg.predict_proba(X_test)[:, 1]

Next we’ll plot the probability of becoming diabetic in a histogram.

plt.hist(y_pred_prob, bins=8, linewidth=1.2)

plt.xlim(0, 1)

plt.title(‘Histogram of predicted probabilities’)

plt.xlabel(‘Predicted probability of diabetes’)

plt.ylabel(‘Frequency’)

Since it’s a binary classification problem, the classification probability threshold is 0.5, which means if the probability is less than 0.5, it’s classified as “0 (non-diabetic)”. If the probability is more than 0.5, it’s classified as “1 (diabetic)”.

We can use the Scikit-learn’s binarize method to set the threshold to 0.3, which will classify as ‘0 (non-diabetic)’ if the probability is less than 0.3, and if it’s greater it will be classified as ‘1 (diabetic)’.

# predict diabetes if the predicted probability is greater than 0.3

from sklearn.preprocessing import binarize y_pred_class = binarize([y_pred_prob], 0.3)[0]

Next we’ll print the confusion matrix for the new threshold predictions, and compare with the original.

# new confusion matrix (threshold of 0.3) confusion_new = confusion_matrix(y_test, y_pred_class)

print(confusion_new)

Fig — New Confusion Matrix

TP = confusion_new[1, 1]

TN = confusion_new[0, 0]

FP = confusion_new[0, 1]

FN = confusion_new[1, 0]

Next we’ll calculate sensitivity and specificity to observe the changes from the previous confusion matrix calculations.

Previously the sensitivity calculated was 0.58. We can observe that the sensitivity has increased, which means it’s more sensitive to predict “positive (diabetic)” instances.

# sensitivity has increased print(TP / float(TP + FN))

print(recall_score(y_test, y_pred_class)) OUTPUT :

0.870967741935

0.870967741935

Using the same process, we can calculate the specificity for the new confusion matrix. Previously it was 0.90. We observe that it has decreased.

# specificity has decreased print(TN / float(TN + FP)) OUTPUT :

0.689075630252

We adjust the threshold of a classifier in order to suit the problem we’re trying to solve.

In the case of a spam filter (positive class is spam), optimization needs to be done for precision. This means it’s more acceptable to have false negatives (spam goes to the inbox) than false positives (non-spam is caught by the spam filter). In the case of a fraudulent transaction detector (positive class is “fraud”), optimization is to be done for sensitivity, which means it’s acceptable to more have false positives (normal transactions that are flagged as possible fraud) than false negatives (fraudulent transactions that are not detected).

ROC curve

An ROC curve is a commonly used way to visualize the performance of a binary classifier, meaning a classifier with two possible output classes. The curve plots the True Positive Rate (Recall) against the False Positive Rate (also interpreted as 1-Specificity).

Scikit-learn provides a method called roc_curve to find the false positive and true positive rates across various thresholds, which we can use to draw the ROC curve. We can plot the curve as follows.

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.plot(fpr, tpr)

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.0])

plt.title(‘ROC curve for diabetes classifier’)

plt.xlabel(‘False Positive Rate (1 — Specificity)’)

plt.ylabel(‘True Positive Rate (Sensitivity)’)

plt.grid(True)

Fig — ROC curve

We’re unable to find the threshold used to generate the ROC curve on the curve itself. But we can use the following method to find the specificity and sensitivity across various thresholds.

def evaluate_threshold(threshold): print(‘Sensitivity:’, tpr[thresholds > threshold][-1])

print(‘Specificity:’, 1 — fpr[thresholds > threshold][-1])

The following is an example to show how the sensitivity and specificity behave with several thresholds.

evaluate_threshold(0.3) OUTPUT :

Sensitivity: 0.870967741935

Specificity: 0.705882352941 evaluate_threshold(0.5) OUTPUT :

Sensitivity: 0.58064516129

Specificity: 0.90756302521

ROC curve is a reliable indicator in measuring the performance of a classifier. It can also be extended to classification problems with three or more classes using the “one versus all” approach.

AUC (Area Under the Curve)

AUC or Area Under the Curve is the percentage of the ROC plot that is underneath the curve. AUC is useful as a single number summary of classifier performance.

In Scikit-learn, we can find the AUC score using the method roc_auc_score .

print(roc_auc_score(y_test, y_pred_prob)) OUTPUT :

0.858769314177

Also, the cross_val_score method, which is used to perform the K-fold cross validation method, comes with the option to pass roc_auc as the scoring parameter. Therefore, we can measure the AUC score using the cross validation procedure as well.

cross_val_score(logreg, X, y, cv=10, scoring=’roc_auc’).mean() OUTPUT :

0.83743085106382975

ROC/AUC advantages:

Setting a classification threshold is not required.

Useful even when there is a high class imbalance.

Summary

In this article, we explored the evaluation of classification models. We discussed the need for an evaluation of a model, and main model evaluation procedures that are used such as “train/test split” and “k-fold cross validation”.

Next we talked about model evaluation metrics in detail along with code samples using Scikit-learn. We discussed, in detail: “classification accuracy”, “confusion matrix”, “roc curve” and “area under the curve”.

Now you should be able to confidently evaluate a classification model and choose the best performing model for a given dataset using the knowledge gained from this article.

Source code that created this post can be found below.

If you have any problems or questions regarding this article, please don’t hesitate to leave a comment below or drop me an email: lahiru.tjay@gmail.com

Hope you enjoyed the article. Cheers!

Discuss this post on Hacker News.