In this article I will try to show you the advantages of using pipelines when you are optimizing your models using hyper-parameters.

We are going to use kaggle’s What’s Cooking? competition for a classical supervised multilabel classification.

Data

Training data is composed of 39774 recipes in a form of:

{

'id': 10259,

'cuisine': 'greek',

'ingredients': [

'romaine lettuce',

'black olives',

'grape tomatoes',

'garlic',

'pepper',

'purple onion',

'seasoning',

'garbanzo beans',

'feta cheese crumbles'

]

}

Evaluation

Our job is to train the model to recognize which cuisine the recipe belongs to based on the ingredients. I am not going to build a kernel from start to finish. I will try to demonstrate how adding pipelines to your workflow can help you fine tune your models.

Plan

We are going to use TfidfVectorizer to convert a collection of ingredients to a matrix of TF-IDF features. Then we are going to use GridSearchCV to fine tune our models.

Very brief introduction to main components

What is TF-IDF?

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Basically TF-IDF is a smart way of rating importance of a word inside of the document. In our case we will be able to identify what are the relative differences between recipes. We will use scikit-learn built-in TfidfVectorizer.

What is a Pipeline?

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’.

Pipeline is a utility that provides a way to automate a machine learning workflow. It lets you to sequentially apply a list of transforms and a final estimator. Transformers can be custom or built-in as we will see below.

What is a GridSearchCV?

Exhaustive search over specified parameter values for an estimator.

GridSearchCV provides a way to test various values for hyper-parameters. You can cross-validated many different hyper-parameters combinations to find out the one set of hyper-parameters which yield the best score.

Let’s get to it!

The data in a current form isn’t good for TfidfVectorizer. We can create a custom transformer to transform it to the right form.

from sklearn.base import TransformerMixin

from sklearn.base import BaseEstimator class DocumentsExtractor(TransformerMixin, BaseEstimator):

def __init__(self, verbose=False):

self.verbose = verbose



def fit(self, X, y=None):

if(self.verbose):

print("Verbose mode on!")

return self



def transform(self, X, y=None):

return [" ".join(item['ingredients']) for item in X]

I will get back to the verbose hyper-parameter in a minute. Please ignore it for now.

DocumentExtractor is now able to transform original data

import json train = json.load(open('./data/train.json'))

de = DocumentsExtractor()

de.fit_transform(train)

to the list of ingredients

[

'romaine lettuce black olives grape tomatoes garlic pepper purple onion seasoning garbanzo beans feta cheese crumbles',

'plain flour ground pepper salt tomatoes ground black pepper thyme eggs green tomatoes yellow corn meal milk vegetable oil',

'eggs pepper salt mayonaise cooking oil green chilies grilled chicken breasts garlic powder yellow onion soy sauce butter chicken livers',

...

]

So we now have 39774 strings ready to be passed into the TfidfVectorizer for further transformation.

But we can do better.

Our first Pipeline

Instead of repeating this process all over again and manually start TfidfVectorizer we are going to write a simple pipeline.

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer



tfidf_pipeline = Pipeline([

('doc_extractor', DocumentsExtractor()),

('tfidf_vectorizer', TfidfVectorizer())

])

Passing original data to the pipeline will transform it to the array of ingredients, DocumentsExtractor(), and then run TF-IDF algorithm, TfidfVectorizer(), to transform it to a matrix of TF-IDF features.

To transform our original data to a matrix of TF-IDF features we can run

tfidf_pipeline.fit_transform(train)

To quickly transform our test data:

test = json.load(open('./data/test.json'))

tfidf_pipeline.transform(test)

Train and test data is now a matrix of TF-IDF features ready to be used by models.

But we can do better.

Training

The first classifier we will train is a multinomial Naive Bayes classifier, MultinomialNB. We are going to create the second pipeline, which is going to use the first pipeline! Then we will cross-validate different combination of hyper-parameters with GridSearchCV.

from sklearn.naive_bayes import MultinomialNB mnb_pipeline = Pipeline([

('tfidf_pipeline', tfidf_pipeline),

('mnb', MultinomialNB())

])

This pipeline will run the first pipeline and then it will run MultinomialNB classifier. So it will convert the raw data to matrix of TF-IDF features and then pass it over for a training to the MultinomialNB.

We can setup GridSearchCV to run mnb_pipeline on a all the combinations of grid_params.

from sklearn.model_selection import GridSearchCV y_train = [item['cuisine'] for item in train] grid_params = {

...

} clf = GridSearchCV(mnb_pipeline, grid_params)

clf.fit(train, y_train) print("Best Score: ", clf.best_score_)

print("Best Params: ", clf.best_params_)

If you were using MultinomialNB before you know that you can tweak few hyperparams: alpha or fit_prior for example.

So you could setup your grid_params in a above snippet as:

import numpy as np grid_params = {

'mnb__alpha': np.linspace(0.5, 1.5, 6),

'mnb__fit_prior': [True, False],

}

The above snippet would run all the possible combinations of hyper-parameters alpha and fit_prior and print Best Score and Best Params.

But if you would check the documentation for TfidfVectorizer you would see a lot of hyperparameters that you can tweak there: max_df, binary and norm just to name a few. DocumentsExtractor as well have one hyperparameter: verbose. Maybe changing the way how you prepare the data would have an effect on models accuracy? Can you do this? Can you tweak the hyper-parameters for any part of your pipeline and check all the possible combinations?

Yes You Can.

import numpy as np y_train = [item['cuisine'] for item in train] grid_params = {

'mnb__alpha': np.linspace(0.5, 1.5, 6),

'mnb__fit_prior': [True, False],

'tfidf_pip__tfidf_vectorizer__max_df': np.linspace(0.1, 1, 10),

'tfidf_pip__tfidf_vectorizer__binary': [True, False],

'tfidf_pip__tfidf_vectorizer__norm': [None, 'l1', 'l2'],

} clf = GridSearchCV(mnb_pipeline, grid_params)

clf.fit(train, y_train) print("Best Score: ", clf.best_score_)

print("Best Params: ", clf.best_params_)

This snippet will run

Fitting 3 folds for each of 720 candidates, totalling 2160 fits

2160 combinations of different parameters for a model and the data preparation step! Your pipeline will be trained and evaluated 2160 times. All in a one go.

Please note that you don’t only have access to hyper-parameters of your estimator but you can reach deep down into your pipeline.

Key thing to keep in mind here is the naming convention of the elements in your pipeline. Every estimator and a level of the pipeline need to be separated via ‘__’. Hence ‘tfidf_pip__tfidf_vectorizer__norm’ means: In the pipeline tfidf_pip estimator tfidf_vectorizer hyper-parameter norm.

The pipelines we have here are very simple but they can get much more sophisticated. Still you would be able to easily tap in to the specific hyper-parameters thanks to the power of pipelines.

OK, You said there are advantages due to automation

Sure. Lets say we want to train another model, LogisticRegression classifier. We know that we want to explore different combinations of: penalty, C and max_iter hyperparams. Plus you have heard that in your particular scenario it is always better to keep binary hyper-parameter as True and don’t worry about norm hyper-parameter while you are building your matrix of TF-IDF features. Here you go:

import numpy as np

from sklearn.linear_model import LogisticRegression y_train = [item['cuisine'] for item in train] lr_pipeline = Pipeline([

('tfidf_pipeline', tfidf_pipeline),

('lr', LogisticRegression())

]) grid_params = {

'lr__penalty': ['l1', 'l2'],

'lr__C': [1, 5, 10],

'lr__max_iter': [20, 50, 100],

'tfidf_pipeline__tfidf_vectorizer__max_df': np.linspace(0.1, 1, 10),

'tfidf_pipeline__tfidf_vectorizer__binary': [True],

} clf = GridSearchCV(lr_pipeline, grid_params)

clf.fit(train, y_train) print("Best Score: ", clf.best_score_)

print("Best Params: ", clf.best_params_)

With copy-paste. 2 minutes of hyper-parameter selection. You are now training new model with the power of the full stack hyper-parameter tuning. Can it get better?

Conclusions

Thank you very much if you got that far. Please let me know what do you think. Was it helpful? Have I made a mistake somewhere? Are you going to try it on your next project? Would you like to here more about any specific part of this article? Have a good day!