Confusion matrix for six-way classification (TF-IDF with Naive Bayes classifier)

A complete NLP classification pipeline in scikit-learn

Go from corpus to classification with this full-on guide for a natural language processing classification pipeline.

What we’ll cover in this story:

Reading a corpus

Basic script structure including logging , argparse and ifmain .

, and . Train/test split

Prior and posterior class probabilities

Baseline classification

Chain multiple features with FeatureUnion

Show the results in a Pandas DataFrame and a confusion matrix

The most important take-outs of this story are scikit-learn/sklearn's Pipeline , FeatureUnion , TfidfVectorizer and a visualisation of the confusion_matrix using the seaborn package, but also more general bites such as ifmain , argparse , logging , zip and *args will be covered.

Dataset

We’ll be using an dataset of Amazon reviews and the simple yet effective Naive Bayes for the classification task. trainset.txt contains a corpus of reviews taken from the Johns Hopkins Multi-Domain Sentiment Dataset and converted to the following format in a space separated .csv file.

music neg 575.txt the cd came as promised and in the condition promised . i 'm very satisfied

As you can see the reviews are already tokenised with a whitespace tokeniser from the nltk package. Each review is on one line and preceded by two tags and the identifier of the review:

a tag that specifies one of the six topics: books , camera , dvd , health , music , software .

, , , , , . a tag which indicates the sentiment expressed by the review, in terms of a positive or negative value: pos , neg .

This dataset enables us to perform a binary classification of sentiment or a multi-class classification of the genre of the review and create our script in such a way that the user can specify which classification task to tackle.

Script structure: imports, logging and argparse

We’re setting up our pipeline using argparse and function flags such as use_sentiment so that we’re able to do both the binary ( pos | neg ) classification task and the multi-class classification task ( book | camera | dvd | health | music | software ) from the command-line.

For those of you who are not familiar, argparse is a super-useful package that enables user-friendly command-line interfaces. If required arguments are missing, it shows an error and it shows all of the different arguments that can be used. Arguments are preceded by the argument tag --input and a whitespace:

$ python3 pipeline.py --input trainset.txt --binary .

We’re also adding a verbosity flag --v and using the logging capabilities of Python to output warnings, errors or info. After the arguments are parsed with args = parser.parse_args() you can then use the input from these arguments with args.input and args.verbosity in your script.

Note argparse does not have a type=bool , which means that everything get’s parsed as a str . In order to add boolean flags, you can set action="store_true" , which takes the False boolean as default, and if the flag --binary is included, will automatically result in a True boolean.

We’ll be chaining all of the functions in this story in a main() function that will automatically be called by the if __name__ == '__main__' statement. When calling this file in the command line, the Python interpreter reads the source file and sets the __name__ variable as '__main__' . This way we can read the source file and execute the functions in it, but also make this file available as a module to import for other scripts, without automatically executing the statements in main() .

Reading corpus

First we’ll need to read our corpus trainset.txt . This function will make use of the --binary flag coming from our argparse function to determine whether we’re doing a binary or multi-class classification.

Train/test split

Now that we have our reviews in documents and our classes in labels , we’re going to split them in a training-set and a test-set for our classifier. We’re going to use a split of 80% training and 20% testing, using the slice notation [:] . First we need to shuffle our data to ensure that this slice is not influencing the results: classes might be overrepresented in the train/test-set, since we don’t know how the documents in our corpus are ordered. They might for instance be ordered alphabetically, which could result in having the classes book | camera | dvd | health solely in our training set.

Since we’re creating a list of tuples as such [(doc1, 'neg'), (doc2, 'pos')] , we can use a neat python function, zip and * to iterate through this list and separate the tuples in a list of documents [doc1, doc2, doc3] and a list of labels ['pos', 'neg', 'pos'] .

Note: although this function may seem a bit verbose, I included it, because it is good to see what happens under the hood here. You can also use sklearn’s train_test_split function which does essentially the same, or use k-fold cross-validation : splitting the dataset in train and test k number of times and taking the average of each classification to ensure that the splits influence the scores as little as possible.

Prior and posterior class probabilities

For the classification task at hand we’ll be using Naive Bayes classifier, which makes use of Bayes theorem: computing new probability distributions over the classes incorporating the features included in the classifiers such as tf-idf or counts, which should make the new probability distribution more representative of the data.

To make sense of the posterior probabilities it is useful to compare them to the prior distributions. We will thus first calculate the prior probabilities of the classes over all documents in our corpus, or the other way around: the probability that one document in our corpus has a certain class. Posterior probabilities can be computed with classifier.predict_proba .

Baseline classification

To compare the evaluation metric (accuracy) and the confusion matrix of our Naive Bayes classifier, we’re going to create a very simple baseline: using the random package, we’re going to randomly assign a label to each document out of the set of possible labels. We could also create a baseline that takes the prior probabilities of each class into account.

Multiple features with FeatureUnion

For this classification task we’re going to add three features that are included in the classifier:

Count vectoriser with POS-tags appended to each token TF-IDF vectoriser (which uses the very powerful term-frequency in document-frequency) An example of feature-engineering where the feature length is included in a pipeline with feature-value mappings to vectors in DictVectorizer .

Now this code is a bit complex, but it is merely an example of how multiple features can be appended in one FeatureUnion pipeline, even including pipelines as done in (.3).

I’ve added a flag for each of the features in the function feature_union() , so that you’re able to turn features on and off accordingly.

Showing the results

Now we’re also going to have to show our results. We can use sklearn’s classification_report , accuracy_score , and confusion_matrix and a bit of beauty with the seaborn package for that last one.

tabular_results() creates a Pandas DataFrame with the tokenised sentences, actual labels, predicted labels and the prior/posterior probabilities. class_report() shows the accuracy scores for the classifier and has a flag show_matrix for showing the beautiful visualisation of the confusion matrix using the vis() function.

Put the pipeline together!

We read our corpus > split the data in train/test > compute prior probabilities > create a FeatureUnion of our three features > fit the classifier to the data > make predictions > compute posterior probabilities > create a DataFrame > report results for baseline > report results for Naive Bayes.

Accuracy scores for our baseline remain around 0.16/0.17 for our six-class classification and around 0.5 for our binary classification, which is logical since it’s the probability/number of classes. Naive Bayes accuracy scores are 0.685 for all three features combined and highest 0.901 when only using tf-idf vectors. This shows that feature-engineering does not always yield better results!