PU Learning in Action

In order to showcase this, I will work through a small example using the Banknote dataset. It’s a dataset that has 2 classes: unauthentic and authentic, denoted by 0 and 1, respectively.

The background of the dataset isn’t all that important because we aren’t going try to do any feature engineering or classification on a test set. Instead, we’re going to simulate a situation in which there are some reliable positive cases and many unreliable negatives (could be a mix of positive and negatives). Okay, lets get to it!💪

To get started, I’ll import the data and inspect it to see the original label value counts and see if there are any null values:

Data import and check

Head of the imported banknote data

Let’s simulate a scenario of unreliable data—first, we’ll balance the data evenly to have 610 of class 0 and 610 of class 1. Then, we’re going to mislabel some of the positive classes as negative (essentially hiding them) to see if the models can recover them.

Balancing the data and mislabeling

As seen above, 300 out of the 610 true positives were mislabelled as negative. The reason for this is that we want to see if our PU learners can recover them. So to summarize what I did:

1220 samples and 4 features

samples and 4 features 610 positive out of 1220 before hiding labels

positive out of 1220 before hiding labels 310 positive out of 1220 after hiding labels

Pseudo class imbalance

To start, let’s set a benchmark. I’m going to train a standard random forest classifier and then compare the result to the original labels to see how many it recovers.

---- Standard Random Forest ----

pred_negative pred_positive

true_negative 610.0 0.0

true_positive 300.0 310.0

None



Precision: 1.0

Recall: 0.5081967213114754

Accuracy: 0.7540983606557377

As you can see, the standard random forest didn't do very well for predicting the hidden positives. Only 50% recall, meaning it didn’t recover any of the hidden positive classes. Let’s extend this further by jumping into PU bagging.

PU Bagging

If you’ll recall from the explanation above, PU bagging is an approach for training many ensemble classifiers in parallel. In a nutshell: it’s basically an ensemble of ensembles. With each ensemble, the classes are balanced to the size of the positive class. This script written by Roy Wright is a great implementation of PU bagging, so we’ll be using it as a wrapper:

---- PU Bagging ----

pred_negative pred_positive

true_negative 610.0 0.0

true_positive 32.0 578.0

None



Precision: 1.0

Recall: 0.9475409836065574

Accuracy: 0.9737704918032787

This approach recovered 578 positive samples out of the 610 (95% recall). Pretty good if you ask me 😊 Let’s take a look at this visually:

Number of positive cases predicted by PU Bagging

Imagine implementing this on a huge dataset with millions of rows that are unreliably labelled. Being able to recover them efficiently is a very useful technique to have. Isn’t that cool?👌🏼