Image courtesy of TopTal.

This series, if and when finished, would likely be split into five parts:

Part 1: The Data (collecting, cleaning, feature engineering)

Part 2: Building a model (algorithms, cross-validation, ensembles)

Part 3: Streaming (Spark, Kafka)

Part 4: Production Ready

Part 5: Visualization and Model Management

In this part of the series, we’ll train and evaluate the model that will do the actual classification. We will be using a technique called “blending;” though in recent years, the actual terminology for this technique has been debated on a level of Vim vs. Emacs, Firefox vs. Internet Something etc. During the rest of this article I’ll describe the technique we’ll be using as “blending,” since this series is more geared towards juniority than seniority.

Briefly, I’ll try to distringuish between the terminology as best I can, but as a junior, this really doesn’t matter; it’s the actual result that matters, not how you call it (though at Royal Illogical University it’s apparently the difference between life and death, so maybe follow the links if need be).

Terminology disclaimer for Ensemble, Blending, Stacking, and Stacked Generalization

Stacking and stacked generalization is the same thing. It was first coined by Wolpert in 1992, and described having a bunch of classifiers that were each “[trained with a different subset] of the learning set and trying to guess the rest of it,” and finally (from Wikipedia) “a logistic regression model is often used as the combiner” for the output of the individual models.

Blending was (as far as I know) first coined by the winning team of the Netflix Price; and simply put, the difference from stacking is that each individual classifier is trained using the same training data instead of different folds. You would still have a “holdout set” of the original data, but each individual classifier is training on the same data. Each trained classifier would then predict on the holdout set, who’s combined output would be the training data for the “blender.”

An Ensemble is just a way to describe a way of combining the results of multiple models, which both of the above techniques does.

Disclaimer: the above two definitions will differ depending on literature and publish date, but in practice (which this series focuses on), it does not matter. In academia it does matter, because research is advanced by “standing on the shoulders of giants,” while in industry it’s more often the case that you will use what works and “trust your metrics.”

If you wish to know why these techniques are so effective, there is a superb article over at the Kaggle Blog that is very much a recommended read.

Overview of our ensemble

The blender we’re building is a simple logistic regression algorithm that will form the final prediction of a few individual classifiers. Most algorithms used in this article are from Scikit-learn.

First, we need to pre-process the data we gathered in part 1 of this series, to get numerical values that the algorithms can operate on. We’ll be using a short pipeline consisting of a stemmer, TF-IDF vectorizer and a dense transformer.

Stemming

Stemming is a process of reducing derived/inflated/conjugated words into their word stems, since more often than not (especially in English), the stem of the words contains enough information for the models to guess the meaning of a sentence by context. If we would skip stemming, our corpora would often have to be very, very large, to account for all the discrepancies. If we’re building an NLP classifier to be trained on the complete archive of the Library of Congress, stemming might actually reduce accuracy, since the corpora is big enough. If you don’t have a Cray/Sunway in your closet or work at FAMGA, it’s often worth doing stemming unless your have a reason not to.

TF-IDF

TF-IDF stands for Term-Frequency, Inverse Document-Frequency, and is a fancy word for giving a number to the importance of words in your corpora (usually called a Vector Space Model).

If a word appears often in one corpus (e.g. one of the forum posts in a thread in your site’s forum) but not so often in the corpora (i.e. it’s not a word people often use in other posts on the forum), a high numerical score will be given for this word. A high score means this word is likely relevant, since it is not often used elsewhere. E.g., the word “kernel” might often be used in threads talking about Linux, but likely not often used in threads about cars.

Simply put, tf-idf can be used as a dimensionality reduction technique by removing superfluous words from the corpora, to speed up model training.

Tf-idf has many other uses as well, but in this series we’ll use it mainly for checking similarity to our spam corpora.

For an introduction on how VSM and TF-IDF works I refer to Christian Perone’s introduction, both part 1 and part 2.

Dense matrix

The tf-df vectorizer of from sklearn actually returns a sparse matrix, which is often the sane thing to do if you have a massive corpora.

When classifying spam vs. ham though, the necessary corpora (for a single language) is often not that huge (if your domain is not too broad), so the increased memory requirements and training time resulting from the conversion is usually okay.

The reason we’ll convert the sparse output matrix to a dense matrix is simply because most sklearn algorithms don’t work well with sparse data, and instead require dense data.

Pre-processing

We’ll be using English for stemming in this series, with the help of the NLTK library for stemming. Vectorization (Vector Space Model) is a technique for representing text as numerical vectors, which is what CountVectorizer in skearln does, but, as of writing this article, it still does not seem to be a native way in sklearn to do both TF-IDF and Stemming in a single transformer, so we’ll create a custom one to do them both in one go:

from sklearn.feature_extraction.text import TfidfVectorizer class StemmedTfidfVectorizer(TfidfVectorizer):

def __init__(self, stemmer, **args):

super(StemmedTfidfVectorizer, self).__init__(args)

self.stemmer = stemmer def build_analyzer(self):

analyzer = super(TfidfVectorizer, self).build_analyzer()

return lambda doc: (self.stemmer.stem(w) for w in analyzer(doc))

The stemmer comes from the initialization:

StemmedTfidfVectorizer(

nltk.stem.SnowballStemmer('english'),

min_df=1, stop_words='english', decode_error='ignore')

The spare-to-dense transformation is just boilerplate sklearn code:

class DenseTransformer(TransformerMixin):

def transform(self, X, y=None, **fit_params):

return X.todense()



def fit_transform(self, X, y=None, **fit_params):

self.fit(X, y, **fit_params)

return self.transform(X)



def fit(self, X, y=None, **fit_params):

return self

Let’s combine them in a dict:

TFIDF = 'tfidf'

DENSE = 'dense' # transformers are the one who first operate on the raw data to

# produce meaningful numerical representations of the

# individual words that the classifiers can operate on

transformers = [

(TFIDF, StemmedTfidfVectorizer(

nltk.stem.SnowballStemmer('german'),

min_df=1, stop_words='german', decode_error='ignore')), (DENSE, DenseTransformer())

]

The classifiers

The ensemble we’ll use in this article contains the following 8 classifiers for the first layer. The parameters here comes from cross-validation using sklearns’s GridSearchCV (an often quicker way is to use AutoML to search for optimal hyper-parameters in often logarithmical time compared to sklearn’s linear time).

The hyper-parameters here are only for reference, and will not work for your domain, they’re only here for reference! You HAVE to do figure out on your own which params works for your data, there’s really no way around it. Books have been written about it, but a shortish intro to it can be read here.

The algorithm and hyper-parameter we’ll be using in this article can be seen below. The variety of algorithms here is more to show how diverse classification algorithms can be rather than proclaiming they’re a good choice for NLP (see the last part in this series for more advanced choices and theories on cutting edge techniques).

Each algorithm below did perform goodish on the ham/spam data set, and their inclusion improved the ensemble’s F1 score a bit, for better or worse (in the end, you have to evaluate all of them for your own domain and your own data set to know if it works for you or not):

classifiers = [

(BAYES, MultinomialNB(alpha=0.02)),

(SVC, LinearSVC(penalty="l1", dual=False, tol=1e-3)),

(PERCEPTRON, Perceptron(n_iter=50)), (LOG, LogisticRegression(

penalty='l2', solver='sag', C=100,

max_iter=250, n_jobs=-1)), (FOREST, RandomForestClassifier(

n_estimators=500, max_depth=100,

verbose=0, n_jobs=-1, oob_score=True)), (XGBOOST, XGBClassifier(

nthread=-1, learning_rate=0.10, n_estimators=17,

subsample=0.6, colsample_bytree=0.4, max_depth=4,

silent=1, objective='binary:logistic',

min_child_weight=4, gamma=6e-5, reg_alpha=1e-6,

reg_lambda=2e-6)),

]

The second level of classification is what’s called the blending. That is, each of the six classifiers above will classify every example it sees, resulting in six outputs. These six outputs will be the input to the classifier in our blending layer (you could have multiple classifiers in the second level as well, and pass that to a third, and so on, but the amount of data you can effectively use to train the classifiers reduces something like quadratically with the amount of levels you use).

We’ll use a simple logistic regression classifier for our blending layer (which is what is commonly used):

blenders = {

BLENDER: LogisticRegression(

C=15, penalty='l2', max_iter=150,

n_jobs=-1, solver='sag')

}

Nota bene: personally I have never ever seen a single practical reason for using more than two levels or more than one blender outside of ML competitions like Kaggle and Numer.ai; the accuracy gained after this is negligible compared to the added complexity in model maintenance test coverage.

In a practical setting, as in industry, you don’t compete on a leader board, you instead try to minimize a loss function defined as a combination of hours per week required for maintenance, deployment, teaching others how it works, etc. Even a 1% gain in accuracy often does not warrant the 10% added complexity when you’re speech recognition model is already at human levels.

Using multiple levels could be thought of as the following:

In the next part, we’ll write some boilerplate code and helper methods to combine that we have been discussing in this part.

Please comment if you found something that wasn’t correct in this article; or if you have any other feedback, positive or negative, I’d be happy to hear about it!