Computational costs of machine learning applications are multiplied by the following:

Pipelines of several transformations Parameter searches over each component of those pipelines

Even though the execution of particular components may only take a few seconds, the multiplication of parameter searches over many components may easily blow up full solution time to several minutes or hours.

Fortunately individual computations for a single parameter set can be done in an embarassingly parallel manner. Scikit learn already supports parallel execution in this manner with joblib.

However, if we're clever we can identify and reuse identical computations shared across pipelines with different parameter sets. In many cases this can lead to very striking performance increases that are completely separate from the gains of parallel computing

This post details a trivial copy of sklearn's Pipeline object built with dask and how, by paying careful attention to naming tasks, we're able to drastically speedup parameter search computations, even without parallelism.

Code is available here: https://github.com/mrocklin/dasklearn