Binary classification with strong class imbalance can be found in many real-world classification problems. From trying to predict events such as network intrusion and bank fraud to a patient’s medical diagnosis, the goal in these cases is to be able to identify instances of the minority class — that is, the class that is underrepresented in the dataset. This, of course, presents a big challenge as most predictive models tend to ignore the more critical minority class while deceptively giving high accuracy results by favoring the majority class.

Several techniques have been used to get around the problem of class imbalance, including different sampling methods and modeling algorithms. Examples of sampling methods include adding data samples to the minority class by either duplicating the data or generating synthetic minority samples (oversampling), or randomly removing majority class data to produce a more balanced data distribution (undersampling).

Different algorithms are also more suitable for class imbalance problems, including those based on boosting, such as Adaptive Boosting (AdaBoost). AdaBoost works to improve the performance of what are known as weak learners (poor predictive models, but better than random guessing). AdaBoost iteratively builds an ensemble of weak learners by adjusting the weights of misclassified data during each iteration. For the training of the first weak learner, AdaBoost assigns equal weight to each training set sample. For each subsequent weak learner, the weights are recalculated such that a higher weight is assigned to samples that the current weak learner misclassified. This weight determines the probability that the sample will appear in the training of the next weak learner. For this reason, boosting algorithms like AdaBoost are particularly useful for class imbalance problems because higher weight is given to the minority class at each successive iteration as data from this class is often misclassified.

Data sampling methods combined with boosting can be an effective way to deal with class imbalance problems. To demonstrate, let’s use Lending Club loan data (available here) to try to predict whether someone will default on a loan using two interesting sampling approaches that have been suggested and combined with AdaBoost: SMOTEBoost and RUSBoost.

SMOTEBoost is an oversampling method based on the SMOTE algorithm (Synthetic Minority Oversampling Technique). SMOTE uses k-nearest neighbors to create synthetic examples of the minority class. SMOTEBoost then injects the SMOTE method at each boosting iteration. The advantage of this approach is that while standard boosting gives equal weights to all misclassified data, SMOTE gives more examples of the minority class at each boosting step. Similarly, RUSBoost achieves the same goal by performing random undersampling (RUS) at each boosting iteration instead of SMOTE.

Trying it out: data preparation

First, I import the data and get rid of all entries that do not correspond to loan_data = “Fully Paid” or “Charged Off” . I set the “Fully Paid” target to 0 and “Charged Off” to 1 .

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import classification_report

from sklearn.metrics import average_precision_score

from sklearn.metrics import precision_recall_curve

from sklearn.utils import resample

from imblearn.over_sampling import SMOTE

import smote

import rus # read in data and set target variable df = pd.read_csv("lending-club-loan-data/loan.csv",

low_memory=False)

df['target'] = ''

df.loc[df.loan_status == 'Fully Paid', 'target'] = 0

df.loc[df.loan_status == 'Charged Off', 'target'] = 1

df = df[df.target != '']

I use the following loan features to build the binary classifier:

delinq_2yrs : number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years

: number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years dti : ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income

: ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income home_ownership : home ownership status provided by the borrower during registration. Values are RENT, OWN, MORTGAGE, OTHER

: home ownership status provided by the borrower during registration. Values are RENT, OWN, MORTGAGE, OTHER grade : LC assigned loan grade

: LC assigned loan grade int_rate : interest rate on the loan

: interest rate on the loan purpose : a category provided by the borrower for the loan request

: a category provided by the borrower for the loan request revol_util : revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit

: revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit total_rec_late_fee : late fees received to date

features = ['grade','home_ownership','dti','purpose','int_rate',

'delinq_2yrs','revol_util','total_rec_late_fee',

'target']

df_model = df[features]

After removing any rows with missing data, the default loans represent about 18% of the chosen dataset.

Finally, before resampling and training the classifier, I transform the categorical data into numeric representation with one hot encoding and split the data into training and testing sets. Sampling methods are only used on the training set while the testing set maintains the same class imbalance seen in the original dataset.

X = pd.get_dummies(df_model.drop('target', axis=1))

y = df_model['target'].tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3,

random_state=42)

Initial Results

I first train a few models with:

just AdaBoost, SMOTE sampling and then AdaBoost, and RUS sampling and then AdaBoost

def adaboost(X_train, X_test, y_train):

model = AdaBoostClassifier(n_estimators=100, random_state=42)

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

return y_pred # AdaBoost

y_baseline = adaboost(X_train, X_test, y_train) # SMOTE

sm = SMOTE(random_state=42)

X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train)

y_smote = adaboost(X_train_sm, X_test, y_train_sm) # RUS

X_full = X_train.copy()

X_full['target'] = y_train

X_maj = X_full[X_full.target==0]

X_min = X_full[X_full.target==1]

X_maj_rus = resample(X_maj,replace=False,n_samples=len(X_min),random_state=44)

X_rus = pd.concat([X_maj_rus, X_min])

X_train_rus = X_rus.drop(['target'], axis=1)

y_train_rus = X_rus.target

y_rus = adaboost(X_train_rus, X_test, y_train_rus)

Results from these implementations provide an idea of any improvement gained from using SMOTEBoost and RUSBoost.

Sampling has significantly improved the recall of the minority class labeled “Default”, with the largest improvement seen from using RUS. Note that the number of samples generated or removed in this implementation is such that the minority and majority classes have equal number of samples before running AdaBoost.

Setting up SMOTEBoost and RUSBoost

For SMOTEBoost implementation, I use the scikit-learn AdaBoost implementation with a modified fit function found here. Before each boosting step, a SMOTE resampling calculates new synthetic examples for the minority class. I use the default k=5 as the number of nearest neighbors to calculate the new data and set n_samples=300 , which is the amount of synthetically generated examples to be used in each boosting step. Care should be taken with setting the n_samples parameter such that the number of minority class samples does not become greater than the majority class samples, causing an opposite class imbalance. The weights given to the new data is normalized based the current boosting step’s training set.

Similarly, RUSBoost implementation is a modified fit function (found here) where before each boosting step the majority class is undersampled. I set n_samples=300 as the number of samples to remove in each boosting step.

target_names = ['Not Default', 'Default']

for algorithm in [smote.SMOTEBoost(n_estimators=100, n_samples=300),

rus.RUSBoost(n_estimators=100, n_samples=300)]:

algorithm.fit(X_train, y_train)

y_pred = algorithm.predict(X_test)

print()

print(str(algorithm))

print()

print(classification_report(y_test, y_pred,

target_names=target_names))

The full implementation of this loan default example can be found here.

Assessing results

For problems with class imbalance, metrics such as precision, recall, and f1-score give good insight to how a classifier performs with respect to the minority class. Depending on the problem, the goal is to optimize precision and/or recall of the classifier. In this case, I want a model that catches the most number of instances of the minority class, even if it increases the number of false positives. A classifier with a high recall score will give the greatest number of potential loan defaults, or at least raise a flag on the most vulnerable cases. In this example, RUSBoost slightly outperforms SMOTEBoost when looking at the recall of the minority class for the given setup.

SMOTEBoost:

RUSBoost:

So, how does this compare to just using AdaBoost without sampling, or using each sampling method just once before running AdaBoost? The results of each implementation for the minority class are summarized below.

The best sampling method for this modeling setup is RUS. It, by far, gives the highest recall for the minority class.

Other useful methods to evaluate a classifier with class imbalance are ROC and precision-recall (PR) curves. Since I’m not as interested in how the model performs on the negative class (‘Not Default’ class), I plot just the precision-recall curves, which do not consider true negatives. These PR curves can be used to compare RUS to RUSBoost results.

Conclusion

While SMOTEBoost holds promise of improving AdaBoost on unbalanced data by broadening the scope of the minority class, in this case, it showed to be the weaker of all sampling approaches for the given model setup. Similarly, RUSBoost did not perform as well as simply RUS sampling followed by AdaBoost.

In general, both sampling methods have their drawbacks, such as the increased model training time for any method that generates additional data, and loss of valuable data for any undersampling technique. Also, when approaching any classification problem with class imbalance, or any machine learning problem altogether, proper data preparation is critical, including making sure that the quality of data is sound as ensemble methods attempt to correct any misclassification. There is a lot of room for improvement in massaging the loan data used in this example to make sure it is getting the most out of the different sampling techniques. The same can be said about implementing the different sampling boosting methods. Some approaches are more fitting for specific types of problems and the degree of imbalance in the data.

At Urbint, we help urban operators solve complex infrastructure problems by producing actionable insights and predicting often highly critical events. In our work, class imbalance is a frequent occurrence. Most events we are trying to predict are rare compared to the negative class. Sampling and boosting are just some of the techniques we employ to optimize the performance of our models.