Master Auto ML in Python — An Overview of the MLBox package 📦

Learn about MLBox to quickly and efficiently train an automated machine learning pipeline for a classification problem in python.

Photo by Crystal Kwok on Unsplash

Today’s post is very special. It’s written in collaboration with Axel de Romblay the author of the MLBox Auto-ML package that has gained a lot of popularity these last years.

If you haven’t heard about this library, go and check it out on Github: It encompasses interesting features, it’s gaining in maturity and is now under active development.

MLBox repo

MLBox repo

In this post, we’ll show you how you can easily use it to train an automated machine learning pipeline for a classification problem. It’ll start by loading and cleaning the data, removing drift, launching a strong pipeline of accelerated optimization and generating predictions.

Let’s get started! 🚀

1 — Introduction to MLBox

MLBox has been presented to many machine learning Meetups. You can check one of the slides here. It’s a good start to have an overview of the library of more generally of the AutoML concept.

Slideshare presentation of MLBox

2 — Downloading the train and test datasets

Throughout this notebook, we’ll be solving the famous Titanic Kaggle challenge which consists of predicting the survival of passengers based on their attributes (Sex, Age, Name, etc).

If you’re not familiar with this competition you can check this article.

Let’s now download the data:

From Kaggle if you have an account

If you have a Kaggle account you can generate an API token right here on your profile page:

Once the API token is generated, you’ll have a kaggle.json downloaded on your system that contains your username and key.

If you’re on Unix-based OS, place this file in ~/.kaggle/ and then

chmod 600 ~/.kaggle/kaggle.json

If you’re on a Windows machine:

export KAGGLE_USERNAME=<your-username>

export KAGGLE_KEY=<your-key>

From the internet:

Make sure you have wget installed: pip install wget

Then place train.csv and test.csv in the data folder at the root of the project.

3 — Environment setup and installing MLBox from PyPI

Creating a conda virtual environment is recommended because MLBox encompasses several dependencies that might mess with your current libraries. Having a clean virtual environment is the right solution and if anything goes wrong you can remove it without impacting your system.

conda create -n automl python=3.7

This creates an environment named automl that has python 3.7 preconfigured on it.

If you’re on OSX like me, you’ll have to install OpenMP (Open Multi-Processing), an efficient implementation of multithreading, via brew:

(base) brew install libomp

Now activate automl and install MLBox directly from PyPI:

(base) source activate automl

(automl) pip install mlbox

Original blog post

As you see, MLBox has quite a lot of dependencies such as scikit-learn, pandas, etc. That’s why we created an empty virtual environment.

[Optional]: accessing the automl kernel from Jupyter.

If you’d like to use jupyter notebook on this environment without activating it but by selecting the kernel only from the base jupyter dropdown list; you’ll have to install ipykernel:

(automl) conda install ipykernel

Original blog post

Now you’re good to go!

4 — Testing MLBox: from data ingestion to model building

Now we’re going to test and run MLBox to quickly build a model to solve the Kaggle Titanic Challenge.

For more information about the documentation of the package and the API you can visit the following links:

The official repository: https://github.com/AxeldeRomblay/MLBox

The official documentation: https://mlbox.readthedocs.io/en/latest/

Importing MLBox

Using TensorFlow backend.





CPU times: user 2.42 s, sys: 740 ms, total: 3.16 s

Wall time: 4.04 s

Inputs to MLBox

If you’re having a train and a test set like in any Kaggle competition, you can feed these two paths directly to MLBox as well as the target name.

Otherwise, if fed a train set only, MLBox creates a test set.

Reading and preprocessing

The Reader class of MLBox is in charge of preparing the data.

It provides methods and utilities to:

Read in the data with the correct separator (CSV, XLS, JSON, and h5) and load it Clean the data by:

deleting Unnamed column

inferring column types (float, int, list)

processing dates and extracting relevant information from it: year, month, day, day_of_week, hour, etc.

removing duplicates

preparing train and test splits

More information here: https://mlbox.readthedocs.io/en/latest/features.html#mlbox.preprocessing.Reader

Original blog post

When this function is done running, it creates a folder named save where it dumps the target encoder for later use.

df["train"].head()

Original blog post

Removing drift

This is an innovative feature I haven’t encountered in other packages. The main idea is to automatically detect and remove variables that have a distribution that is substantially different between the train and the test set.

This happens quite a lot and we generally talk about biased data. You could have for example a situation when the train set has a population of young people whereas the test has elderly only. This indicates that the age feature is not robust and may lead to poor performance of the model when testing. So it has to be discarded.

More information:

about the algorithm: https://github.com/AxeldeRomblay/MLBox/blob/master/docs/webinars/features.pdf

about MLBox implementation: https://mlbox.readthedocs.io/en/latest/features.html#mlbox.preprocessing.Drift_thresholder

Slideshare presentation of MLBox

How does MLBox compute drifts for individual variables

MLBox builds a classifier that separates train from test data. It then uses the ROC score related to this classifier as a measure of the drift.

This makes sense:

If the drift score is high (i.e. the ROC score is high) the ability the discern train data from test data is easy, which means that the two distributions are very different.

Otherwise, if the drift score is low (i.e. the ROC score is low) the classifier is not able to separate the two distributions correctly.

MLBox provides a class called Drift_thresholder that takes as input the train and test sets as well as the target and computes a drift score of each one of the variables.

Drift_thresholder then deletes the variables that have a drift score higher than a threshold (default to 0.6).

Original blog post

As you see here, Name , PassengerId and Ticket get removed because of their respective drift scores. If you think about it, this is not surprising at all because these variables, given their nature, can have any random value thus resulting in plausible drift between their train and test distributions.

The heavy lifting: optimizing

This section performs the optimization of the pipeline and tries different configurations of the parameters:

NA encoder (missing values encoder)

CA encoder (categorical features encoder)

Feature selector (OPTIONAL)

Stacking estimator — feature engineer (OPTIONAL)

Estimator (classifier or regressor)

More details here: https://mlbox.readthedocs.io/en/latest/features.html#mlbox.optimisation.Optimiser

We first instantiate the Optimiser class:

opt = Optimiser()

Then we can run it using the default model configuration set as default (LightGBM) without any autoML or complex grid search.

This should be the first baseline

Original blog post

The neg log loss = -0.6325 as a first baseline.

Let’s now define a space of multiple configurations:

ne__numerical_strategy: how to handle missing data in numerical features

ce__strategy: how to handle categorical variables encoding

fs: feature selection

stck: meta-features stacker

est: final estimator

Let’s now evaluate this model:

opt.evaluate(params, df)

Original blog post

Running this pipeline resulted in a higher neg loss, which is better.

There’s a very good potential of more improvement if we define a better space of search or stacking operations and maybe other feature selection techniques.

5 — Running predictions

Now we fit the optimal pipeline and predict our test dataset.

More details here: https://mlbox.readthedocs.io/en/latest/features.html#mlbox.prediction.Predictor

6 — Conclusion

Running an automated AutoML pipeline has never been easier. With MLBox, you can do this very quickly and efficiently so that you can focus on what matters when solving a business problem.

Understanding the problem

Acquiring and consolidating the right data

Formalizing the performance metrics to reach and compute

Let’s hope these three first steps don’t get automated soon :)

We hope you liked this library. Don’t hesitate to give it a star on Github or report an issue to its contributor.