Photo by SJ Baren on Unsplash

TLDR;

HandySpark is a Python package designed to improve PySpark user experience, especially when it comes to exploratory data analysis, including visualization capabilities and, now, extended evaluation metrics for binary classifiers.

Try it yourself using Google Colab:

Check the repository:

Introduction

In my previous post, I introduced HandySpark, a package for PySpark I developed to help closing the gap between pandas and Spark dataframes.

Today, I am pleased to announce the release of a new version which, not only solves some performance issues with stratified operations (which should be several times faster now!), but also makes evaluating binary classifiers much easier.

A Binary Classification Task

We’ll use a simple binary classification task to illustrate the extended evaluation capabilities provided by HandySpark: predicting passenger survival using the Titanic dataset, one more time.

Let’s first set everything up and load our data:

Setting everything up and loading the Titanic dataset :-)

To evaluate a model, we need to train it first. To train a model, we need to clean the dataset up first. So, let’s start with that!

Cleaning Data with HandySpark

We know there are missing values for Age, Cabin and Embarked . You can check this out easily using the isnull method.

To keep our model and pipeline as simple as possible, let’s stick with numeric variables only, like Age , Fare , SibSp and Parch.

For imputing missing values for Age , we could just use a simple mean for every missing value, right? But do you really think passenger’s ages of men and women, from first, second and third classes were similar?

Let’s check it out using the stratify operation, just like this:

hdf.stratify(['Pclass', 'Sex']).cols['Age'].mean() Pclass Sex

1 female 34.611765

male 41.281386

2 female 28.722973

male 30.740707

3 female 21.750000

male 26.507589

Name: Age, dtype: float64

There’s clearly a difference… women were younger than men and the lower the class, the younger the passengers. Not surprising, I would say…

What about outliers? We can use Tukey’s fences method to identify and then fence values considered extreme. How many outliers do we have for Fare ?

hdf.cols['Fare'].outliers(k=3) Fare 53

Name: outliers, dtype: int64

Please keep in mind that Tukey’s fences are quite sensitive— they assume everything above k times the inter-quartile range (IQR) are extreme values. In our case, it resulted in 53 outliers! You can try different values for k to calibrate it, though…

So, let’s use what we found about the data to clean it up. First we fill missing Age values according to their averages for a given Pclass and Sex . Next, we fence Fare values using Tukey’s method:

Cleaning up!

Building a Model

Once we have a clean dataset, we can build a simple classifier to predict if a passenger survived or not the Titanic disaster. Let’s use Spark’s RandomForestClassifier for this task.

But, remember Spark ML algorithms need all features neatly assembled into a feature vector. Moreover, this feature vector does not accept missing values.

How to handle this? We can simply use HandySpark’s imputer and fencer methods to create the corresponding transformers for filling missing values and fencing outliers. Then we add these transformers to our pipeline and we are good to go!

Training a classifier and making predictions

What do our predictions look like? Let’s make the DataFrame a HandyFrame and look at our labels and predictions generated by the classifier:

predictions.toHandy().cols[['probability', 'prediction', 'Survived']][:5]

First 5 rows from our predictions

The probability column contains a vector with probabilities associated with classes 0 and 1, respectively. For evaluation purposes, we need the probability of being a positive case, thus we should look at the second element of the probability vector. The prediction column shows us the corresponding class, assuming a threshold of 0.5.

Evaluating a Model

How good is our model? We could just compute the proportion of rows have matching prediction and Survived columns, that would be our accuracy. It turns out, though, accuracy is not such a great metric for evaluating a binary classifier.

To really tell how good our model is, we need other metrics, such true positive rate (TPR, also known as recall), false positive rate (FPR) and precision, which will vary with the threshold we choose to turn predicted probabilities into predicted classes (0 or 1).

If we go over every possible threshold and compute these metrics, we can build both Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve for a given model.

It brings up another question: how to compare two models using these curves? They may cross over each other at different points, right? One way of tackling this issue is to look at the area under the curve: the bigger the area, the better the model, roughly speaking. Thus we can compute the Area Under the ROC curve (AUROC, ROC AUC, or sometimes just AUC) and the Area Under the PR curve (PR AUC).

Evaluating a Model with PySpark

We have both good and some bad news… the good news are: PySpark gives us both ROC AUC and PR AUC. The bad news are: PySpark gives us ONLY that :-(

Using PySpark Evaluator to get ROC AUC and PR AUC

What if we want to experiment with different thresholds? HandySpark to the rescue :-)

Evaluating a Model with HandySpark

HandySpark extends PySpark’s BinaryClassificationMetrics , since its Java counterpart already has several methods for retrieving metrics and thresholds, which are now exposed by HandySpark. But there are newly implemented methods too.

HandySpark also makes it possible to use a DataFrame containing predicted probabilities and labels as argument, as shown in the gist below:

Plotring ROC and PR curves, getting metrics by threshold and confusion matrices with HandySpark!

Let’s dig into all these new possibilities for evaluating a model.

Plotting Curves

An image is worth a thousand words! So, let’s start with the plots of both curves using plot_roc_curve and plot_pr_curve methods. It is as simple as that:

fig, axs = plt.subplots(1, 2, figsize=(12, 4))

bcm.plot_roc_curve(ax=axs[0])

bcm.plot_pr_curve(ax=axs[1])

ROC and PR curves

Voilà! Now we can tell that, if we are willing to accept a False Positive Rate of 20%, we’ll get a True Positive Rate above 60%. Good enough? Cool!

Which threshold should we use to achieve this, you ask?

Thresholds and Metrics

We can get all thresholds and corresponding metrics using the getMetricsByThreshold method. It returns a Spark DataFrame, which we can then filter for the metric we’re interested in (FPR between 19% and 21%, in our case):

bcm.getMetricsByThreshold().filter('fpr between 0.19 and 0.21').toPandas()

Thresholds for FPR between 19% and 21%

We need the False Positive Rate to be at most 20%, so the corresponding threshold is 0.415856. This will give us a True Positive Rate (Recall) of 68.1% and a Precision of 68.3%.

We could also look at the confusion matrix built using that particular threshold, which yields the metrics we just got.

Confusion Matrix

Confusion matrices can be, er… confusing! ;-) To try to minimize any misinterpretation that could possibly arise from reading it, I’ve labeled both columns and rows, so you don’t need to guess which class comes first and which ones are predicted and actual values.

Just call the print_confusion_matrix with your chosen threshold and that’s it, simple like that:

bcm.print_confusion_matrix(.415856)

Confusion Matrix — not so confusing anymore :-)

Final Thoughts

My goal is to improve PySpark user experience, making it easier to perform data cleaning and model evaluation. Needless to say, this is a work in progress, and I have many more improvements already planned.

If you are a Data Scientist using PySpark, I hope you give HandySpark a try and let me know your thoughts on it :-)

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.