0. Introduction and motivation

Binary classification problem is arguably one of the simplest and most straightforward problems in Machine Learning. Usually we want to learn a model trying to predict whether some instance belongs to a class or not. It has many practical applications ranging from email spam detection to medical testing (determine if a patient has a certain disease or not).

Slightly more formally, the goal of binary classification is to learn a function f(x) that map x (a vector of features for an instance/example) to a predicted binary outcome ŷ (0 or 1). Most classification algorithms, such as logistic regression, Naive Bayes and decision trees, output a probability for an instance belonging to the positive class: Pr(y=1|x).

Class imbalance is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families (e.g. Kinase, GPCR). Failure to account for the class imbalance often causes inaccurate and decreased predictive performance of many classification algorithms. In this post, I will introduce a couple of practical tips on how to combat class imbalance in binary classification, most of which can be easily adapted to multi-class scenarios.