Real-world data often contains heterogeneous data types. When processing the data before applying the final prediction model, we typically want to use different preprocessing steps and transformations for those different types of columns.

A simple example: we may want to scale the numerical features and one-hot encode the categorical features.

Up to now, scikit-learn did not provide a good solution to do this out of the box. You can do the preprocessing beforehand using eg pandas, or you can select subsets of columns and apply different transformers on them manually. But, that does not easily allow to put those preprocessing steps in a scikit-learn Pipeline , which can be important to avoid data leakage or to do a grid search over preprocessing parameters.