Build pipelines with Pandas using “pdpipe”

We show how to build intuitive and useful pipelines with Pandas DataFrame using a wonderful little library called pdpipe.

Introduction

Pandas is an amazing library in the Python ecosystem for data analytics and machine learning. They form the perfect bridge between the data world, where Excel/CSV files and SQL tables live, and the modeling world where Scikit-learn or TensorFlow perform their magic.

A data science flow is most often a sequence of steps — datasets must be cleaned, scaled, and validated before they can be ready to be used by that powerful machine learning algorithm.

These tasks can, of course, be done with many single-step functions/methods that are offered by packages like Pandas but a more elegant way is to use a pipeline. In almost all cases, a pipeline reduces the chance of error and saves time by automating repetitive tasks.

In the data science world, great examples of packages with pipeline features are — dplyr in R language, and Scikit-learn in the Python ecosystem.

A data science flow is most often a sequence of steps — datasets must be cleaned, scaled, and validated before they can be ready to be used

Following is a great article about their use in a machine-learning workflow.

Pandas also offer a .pipe method which can be used for similar purposes with user-defined functions. However, in this article, we are going to discuss a wonderful little library called pdpipe, which specifically addresses this pipelining issue with Pandas DataFrame.

In almost all cases, a pipeline reduces the chance of error and saves time by automating repetitive tasks

Pipelining with Pandas

The example Jupyter notebook can be found here in my Github repo. Let’s see how we can build useful pipelines with this library.

The dataset

For the demonstration purpose, we will use a dataset of US Housing prices (downloaded from Kaggle). We can load the dataset in Pandas and show its summary statistics as follows,