It got very easy to do Machine Learning: you install a ML library like scikit-learn or xgboost, choose an estimator, feed it some training data, and get a model which can be used for predictions.

Ok, but what's next? How would you know if it works well? Cross-validation! Good! How would you know that you haven't messed up the cross validation? Are there data leaks? If the quality is not good enough, how to improve it? Are there data preprocessing errors or other software bugs? ML systems are notably good at hiding bugs - they can adapt, so often in case of bug there is a small quality drop, but the system as a whole still works. Should we put the model to production? Can the model be trusted to do reasonable things? Are there pitfalls? How to convince others the system works in a reasonable ways?

There is no silver bullet; I don't know a true answer for these questions. But understanding of how the model "thinks" - what its decisions are based on - should be a big part of the answer.

AI-powered robots haven't conquered us (yet), so let's start with a 19th century Machine Learning method: Linear Regression.

Machine Learning in 1800s

As a concrete example, let's say we want to estimate pizza price using Linear Regression. We think that pizza radius, a number of salami slices and a number of tomato slices could affect the price (i.e. we've defined 3 features: radius, salami count, tomato count). So we walk around our XIX century town, visit every pizzeria, order a coffee and take notes of pizzas being sold: price, radius, salami, tomato. After a few gallons of coffee we can derive a formula, based on the notes:

price = 1.5 ✕ radius + 0.4 ✕ salami + 0.1 ✕ tomato

Coefficients 1.5, 0.4 and 0.1 are selected such as that price is not too off for pizzas we've seen. What we did is a Linear Regression: result is computed as

$latex y = w_0 x_0 + w_1 x_1 + ... + w_n x_n &s=1$

- a weighted sum of inputs. $latex w_0, w_1, ..., w_n&s=1 $ are regression parameters (weights, coefficients) which we adjust based on training data; $latex x_0, x_1, ..., x_n&s=1$ are input variables (e.g. pizza radius or a number of salami pieces). Formula can be also written in a vector form: $latex y = x^T w &s=1$

Most people agree that $latex price = 1.5 \times radius + 0.4 \times salami + 0.1 \times tomato&s=1$ is pretty understandable. Looking at coefficients of a Linear Regression can be enough for humans to see what's going on. For example, we can see that in some sense salami is more important than tomatoes - salami count is multiplied by 0.4, while tomato count is multiplied only by 0.1; this can be an useful information. We can also see how much adding a piece of salami or increasing radius by 1cm affects the price.

There are caveats though. If scales of features are not the same then comparing coefficients can be misleading - maybe there are 25 tomato slices on average, and only 2 salami slices on average (weird 19th century pizzas!), and so tomatoes contribute much more to the price than salami, despite their coefficient being lower. It is also obvious that radius and salami coefficients can't be compared directly. Another caveat is that if features are not independent (e.g. there is always an extra tomato per salami slice), interpreting coefficents gets trickier.

One more observation is that to explain the behavior we didn't care how to train the model (how we came up with radius/salami/tomato coefficients), we only needed to know the final formula (algorithm) used at prediction time. It means that we can look at Ridge or Lasso or Elastic Net regression the same way, as they are the same at prediction time.

That said, understanding of the training process can be important for understanding behavior of the system. For example, if two features are correlated, Ridge regression tend to keep both features, but set lower coefficients for them, while Lasso may eliminate one of the features (set its weight to zero) and use a high coefficient for the remaining feature. It means that e.g. in Ridge regression you're more likely to miss an important feature if you look at top coefficients, and in Lasso a feature can get a zero weight even if it is almost as important as the top feature.

So, there are two lessons. First, looking at coefficients is still helpful, at least as a sanity check. Second, it is good to understand what you're looking at, because there are caveats.

It is no longer 19th century: we don't have to walk around beautiful Italian towns, drink coffee and eat pizzas to collect a dataset, we can now go to the Internet. Likewise, for Linear Regression we can use libraries like numpy or scikit-learn instead of visiting a library, armed with quill and paper.

Linear Regression in 2010s

Let's apply Linear Regression to an example task: predict house pricing based on attributes like town crime rate, pollution, house location, etc. We'll use "Boston Housing" dataset; it is available in scikit-learn:

from sklearn . datasets import load_boston from sklearn . linear_model import LinearRegression boston = load_boston ( ) reg = LinearRegression ( ) reg . fit ( boston . data , boston . target )

Internally LinearRegression produces a formula similar to our pizza's formula. To check its coefficients we can look at reg.coef_ attribute:

>> > reg . coef_ array ( [ - 1.07170557e-01 , 4.63952195e-02 , 2.08602395e-02 , 2.68856140e+00 , - 1.77957587e+01 , 3.80475246e+00 , 7.51061703e-04 , - 1.47575880e+00 , 3.05655038e-01 , - 1.23293463e-02 , - 9.53463555e-01 , 9.39251272e-03 , - 5.25466633e-01 ] )

Linear regression is supposed to be readable, but result above is not; what is unclear is which coefficient corresponds to which feature. So we need to combine these coefficients with feature names; this is easy:

def get_formula ( reg , feature_names ) : return " " . join ( [ "{:+.4f}*{}" . format ( coef , name ) for coef , name in zip ( reg . coef_ , feature_names ) ] ) >> > print ( get_formula ( reg , data . feature_names ) ) - 0.1072 * CRIM + 0.0464 * ZN + 0.0209 * INDUS + 2.6886 * CHAS - 17.7958 * NOX + 3.8048 * RM + 0.0008 * AGE - 1.4758 * DIS + 0.3057 * RAD - 0.0123 * TAX - 0.9535 * PTRATIO + 0.0094 * B - 0.5255 * LSTAT

The result is still scary, but at least we can check if a feature contributes positively or negatively to a price. For example, CRIM (crime level) is a negative factor, while CHAS (if a river is nearby) is a positive factor. It is not possible to compare coefficients directly because scales of features are different; we may normalize data to make scales comparable using e.g. preprocessing utilities from scikit-learn - try it yourselves.

To make inspecting coefficients easier we created eli5 Python library. It can do much more than that, but it started from a snippet similar to a snippet above, which we were copy-pasting across projects.

It shows the same coefficients, but there is also a "<BIAS>" feature. We forgot about it when writing the ``get_formula`` snippet: LinearRegression by default creates a feature which is 1 for all examples (it is called "bias" or "intercept"); its weight is in ``reg.intercept_`` atribute.

``eli5`` knows where to get coefficients from for a wide range of ML models from several ML frameworks. It provides utilities for common tasks (e.g. you can check only top features or filter them by name), can output to IPython, html, JSON or plain text. ELI5 can also remind you about caveats of the interpretation method - for example, we can get this for our Linear Regression:

So, the lesson here is that machine learning libraries like scikit-learn expose coefficients of trained ML models; it is possible to inspect them, and eli5 library makes it easier.

Text Classification using Linear Models

Let's say we know nothing about Machine Learning, and want to classifiy documents into several classes - for example, as documents about computer graphics or documents about medicine. If you give this task to someone smart without any ML experience, he/she may propose to solve it this way:

Find some keywords specific for categories. For example, 'computer', 'graphics', 'photoshop' and '3D' for computer graphics, and 'kidney', 'treatment', 'pill' for medicine. Count how many of the keywords from each set are in the document. Category which gets more keywords wins.

We can write it this way: $latex y = computer + graphics + photoshop - kidney - treatment - pill$; if $latex y > 0 &s=1$ then text is about computer graphics; if it is less than zero then we have a medical document.

A smart person may also notice that some of keywords can be more important than others - if there is 'photoshop' in text then the text is very likely to be about CG (computer graphics), but a word 'pen' can be only a small indicator. So to improve the quality one can assign each word a weight, e.g.:

$latex y = 1.0 \times computer + 1.5 \times graphics + 2.0 \times photoshop - 5.0 \times kidney - 0.5 \times treatment - 0.5 \times pill &s=1$

Many smart people are lazy, so they likely won't be fond of idea of adjusting all these coefficients by hand. A better idea could be to take documents about CG and documents about medicine, then write a program to find best coefficients automatically.

It already starts to look suspiciously similar to pizza's Linear Regression formula, isn't it? The difference is that we are not interested in ``y`` value per se, we want to know if it is greater than zero or not.

Such "y" function is often called a "decision function": we compute "y", check if it is greater or less than zero, and make a yes/no decision. And in fact, this is a very common approach: for example, at prediction time linear Support Vector Machine (SVM) works exactly as our "y" function. Congratulations - now you know how linear SVMs work at prediction time! If you look at coefficients of a linear SVM classifier applied for text classification using "bag-of-words" features (similar to what we've done), then you'll be looking at the same weights as in our example. There is a weight (coefficient) per word, to do prediction linear SVM computes weighted sum of tokens present in a document, just like our "y" function, and then the result is compared to 0.

We may also notice that a larger "y" (positive or negative) means that we're cetrain a document is about CG or about medicine (it has more relevant keywords), while ``y`` close to zero means we either don't have enough information, or keywords cancel each other.

Let's say we calculated "y" and got "2.5" value. What does $latex y=-2.5&s=1$ mean? To make "y" value easier to interpret it'd be nice for it to be in range from 0 to 1 - this way we can think about it as of a probability. For example, when keywords sum is a very large negative number, "y" could be close to 0 (a probaility of a document bein a CG document is close to 0), when there is no information "y" could be 0.5, and when sum is a large positive number "y" could be close to 1.0.

To implement this idea one can use a function which transforms original, unbounded scores, to (0, 1) range: $latex y = f(1.0 \times computer + ... - 5.0 \times kidney - ...)&s=1$

So we need a function which takes a value in arbitrary range, and returns a number from 0 to 1. There are many options, but if we take "Logistic function" as such function

$latex f = \frac{1}{1+e^{-x}}&s=3$

then we get a Machine Learning model called Logistic Regression. Congratulations - you now know how Logistic Regression works at prediction time!

Note that at prediction time Logistic Regression and Linear SVM do exactly the same if you only need yes/no labels and don't need probabilities. But they still differ in how weights are selected during the training, i.e. for the same training data you'll get different weights (and so different predictions). Logistic Regression chooses best weights for good probability approximation, while Linear SVM chooses weights such as that decisions are separated by a larger margin; it is common to get a tiny bit higher yes/no accuracy from a linear SVM, but linear SVMs as-is won't give you a probability score.

Now as you know how Logistic Regression and linear SVMs work and what their coefficients mean, it is time to apply them to a text classification task and check how they are making their predictions. These simple linear models are surprisingly strong baselines for text classification, and they are easy to inspect and debug; if you have a text classification problem it is a good idea to try text-based features and a linear model first, even if you want to go fancy later.

Scikit-Learn docs have a great tutorial on text processing using bag-of-words features and simple ML models. The task in the tutorial is almost the same in our example: classify a text message as a message about computer graphics, medicine, atheism or Christianity. This tutorial uses 4 possible classes, not two. We only discussed how to classify a text document into two classes (CG vs medicine), but don't worry.

A common way to do multi-class classification (and the way which is used by default in most of scikit-learn) is to train a separate 2-class classifier per each class. So under the hood there will be 4 classifers: CG / not CG, medicine / not medicine, atheism / not atheism, Christianity / not Christianity. Then, at prediction time, all four classifiers are employed; to get a final answer highest-scoring prediction among all classifiers is used.

It means that instead of inspecting a single classifier we'll be inspecting 4 different classifiers which work together to get us an answer.

Looking into Text Classifier

First, let's load the data, as in the tutorial:

from sklearn . datasets import fetch_20newsgroups categories = [ 'alt.atheism' , 'soc.religion.christian' , 'comp.graphics' , 'sci.med' ] twenty_train = fetch_20newsgroups ( subset = 'train' , categories = categories , shuffle = True , random_state = 42 ) twenty_test = fetch_20newsgroups ( subset = 'test' , categories = categories , shuffle = True , random_state = 42 )

The final model showed in the tutorial is a linear SVM trained on TF*IDF bag-of-words features using SGD training algorithm. We already know how a linear SVM works at prediction time, and we don't care about training algorithm.

TF*IDF bag-of-words features are very similar to "bag-of-words" features we used before - there is still a coefficient per word. The difference is that instead of counting words or simply checking if a word is in a document, a more complex approach is used: words counts are now normalized according to document length, and the result is downscaled for words that occur in many documents (very common words like "he" or "to" are likely to be irrelevant).

from sklearn . linear_model import SGDClassifier from sklearn . feature_extraction . text import TfidfVectorizer from sklearn . pipeline import Pipeline vec = TfidfVectorizer ( ) clf = SGDClassifier ( loss = 'hinge' , penalty = 'l2' , alpha = 1e - 3 , n_iter = 5 , random_state = 42 ) text_clf = Pipeline ( [ ( 'vect' , vec ) , ( 'clf' , clf ) , ] ) . fit ( twenty_train . data , twenty_train . target ) print ( text_clf . score ( twenty_test . data , twenty_test . target ) )

The quality of this simple pipeline is quite good (0.913 accuracy). But let's check how this classifier works internally, what coefficients it learned:

Here we have much more parameters than in previous examples - a parameter per word per class; there are 4 classes and 20K+ words, so looking at all parameters isn't feasible. Instead of displaying everything eli5 shows only parameters with largest absolute values - these parameters are usually more important (of course, there are caveats).

We can see that a lot of these words make sense - "atheism" is a relevant word for atheism-related messages, "doctor" is a good indicator that a text is a medical text, etc. But, at the same time, some of the words are surprising: why do "keith" and "mathew" indicate a text about atheism, and "pitt" indicates a medical text? It doesn't sound right, something is going on here.

Let's find this mysterious Mathew in the training data:

text = [ d for d in twenty_train . data if 'mathew' in d . lower ( ) ] [ 0 ] print ( text )

From: mathew

Subject: Re: ( I am almost sure that Zyklon-B is immediate and painless method of

> death. If not, insert soem other form. )

>

> And, ethnic and minority groups have been killed, mutilated and

> exterminated through out history, so I guess it was not unusual.

>

> So, you would agree that the holocost would be allowed under the US

> Constitution? [ in so far, the punishment. I doubt they recieved what would

> be considered a "fair" trial by US standards. Don't be so sure. Look what happened to Japanese citizens in the US during

World War II. If you're prepared to say "Let's round these people up and

stick them in a concentration camp without trial", it's only a short step to

gassing them without trial. After all, it seems that the Nazis originally

only intended to imprison the Jews; the Final Solution was dreamt up partly

because they couldn't afford to run the camps because of the devastation

caused by Goering's Total War. Those who weren't gassed generally died of

malnutrition or disease. mathew

Aha, we have messages as training examples, and some guy named Mathew wrote some of them. His name is in the message header (From: mathew...), and in the message footer. So instead of focusing on message content, our ML system found an easier way to classify messages: just remember person names and email addresses of notable message authors. It may depend on a task, but most likely this is not what we wanted model to learn. Likely we wanted to classify message content, not message authors.

It also means that likely our accuracy scores are too optimistic. There are messages mentioning Mathew both in training and testing part, so the model can use message author name to get score points. A model which thinks "Oh, this is my old good friend Mathew! He only talks about atheism, I don't care much about what he's saying" can still get some accuracy points, even if it does nothing useful for us.

A lesson learned: by inspecting model parameters sometimes it is possible to check if the model is solving the same problem as we think.

It doesn't make sense to try more advanced models or tune parameters of the current model at this point: it looks like there is a problem in task specification, and evaluation setup is also not correct for the task we're solving (assuming we're interested in message texts).

So it could give us at least two ideas: 1) probably we could get a better train/test split for the data if messages by the same author (or mentioning the same author, e.g. via replying) only appear either in train or in test part, but not in both; 2) to train an useful classifier on this data it could make sense to remove message headers, footers, quoting, email addresses, to make model focus on message content - such model could be more useful on unseen data.

But does the model really only care about Mathew in the example? Until now, we were checking model coefficients; it allows us to get some general feeling of how the model works. But this method has a downside: it is not obvious why a decision was made on a concrete example.

A related downside is that coefficients depend on feature scales; if features use different scales we can't compare coefficients directly. While indicator bag-of-word features (1 if a word is in a document and 0 otherwise) use the same scale, with TF*IDF features input values are different for different words. It means that for TF*IDF a coefficient with top weight is not necessarily the most important, as in the input data word weight could be low because of IDF multiplier, and a high coefficient just compensates this.

We only looked at coefficients for words, but we haven't checked which words are in the document, and what are the values coefficients are multiplied by. Previously we were looking at something like $latex y = 2.0 \times atheism + 1.9 \times keith + 1.4 \times mathew + ...$ (for all possible words), but for a concrete example values of "mathew" and "from" are known - it could be raw word counts in the document, or 0/1 indicator values, or TF*IDF weighted counts, as in our example, and a list of words is much smaller - for most of the words value is zero.

ELI5 provides a helper to do that computation; even better, it knows how to work with scikit-learn text processing utilities, so instead of showing a table with contribution values it can show these word contributions by highlight them in text:

Green highlighting means positive contribution, red means negative.

It seems the classifier still uses words from message text, but names like "Mathew", email addresses, etc. look more important for a classifier. So yeah, even without author name classifier likely makes a correct decision for this example, but it focuses mostly on wrong parts of the message.

Let's try one of the ideas - to make the classifier more useful remove message headers, footers and emails from the training data. We would have to write some code for it, but for this particular dataset something similar is already implemented in scikit-learn:

twenty_train = fetch_20newsgroups ( subset = 'train' , categories = categories , shuffle = True , random_state = 42 , remove = [ 'headers' , 'footers' , 'quotes' ] , ) twenty_test = fetch_20newsgroups ( subset = 'test' , categories = categories , shuffle = True , random_state = 42 , remove = [ 'headers' , 'footers' , 'quotes' ] , )

After re-training of the original pipeline, accuracy becomes much worse

(0.796 instead of 0.913). There are two main reasons for that:

Model is no longer able to use author names, emails, etc.; it must learn how to distinguish messages based only on text content, which is a harder (and arguably a more realistic) task. We've removed some useful information as well, e.g. message subject or text of quoted messages. We should try to bring this information back, but we need to be very careful with evaluation: for example, messages from the test set shouldn't quote messages from train set, and vice versa.

Let's check weights of the updated model:

Preprocessing helped - all (or most) of author names are gone, and feature list makes more sense now.

Stop Words

Some of the features still look strange though - why is "of" the most negative word for computer graphics documents? It doesn't make sense. "Of" is just a common word which appears in many documents. Probably, all other things equal, a document is less likely to be a computer graphics document, and the model learned to use a common, "background" word "of" to encode this information.

A classical approach for improving text classification quality is to remove "stop words" from text - generic words like "of", "it", "was", etc., which shouldn't be specific to a particular topic. The idea is to make it easier for model to learn something useful. In our example we'd like the model to focus more on the topic-specific words and use a special "bias" feature instead of relying on these "background" words.

There are stop words lists for many languages; scikit-learn has a list of such words for English built-in:

>> > from sklearn . feature_extraction . text import ENGLISH_STOP_WORDS >> > print ( ENGLISH_STOP_WORDS ) frozenset ( { 'must' , 'across' , 'afterwards' , 'back' , 'besides' , 'itself' , 'noone' , 'along' , 'some' , 'them' , 'why' , 'de' , 'on' , 'am' , 'three' , 'such' , 'were' , 'fill' , 'if' , 'ten' , < . . . snip . . . > 'becomes' , 'all' , 'detail' , 'except' , 'is' , 'show' , 'cannot' , 'this' , 'side' , 'last' , 'well' , 'mine' , 'wherein' , 'bottom' , 'least' , 'others' , 'a' , 'inc' , 'within' , 'after' , 'done' , 'might' , 'everyone' , 'name' , 'none' , 'up' , 'was' , 'below' , 'they' , 'therein' , 'found' , 'thin' } )

There is a TfidfVectorizer argument to use this stop words list; let's try it:

vec = TfidfVectorizer ( stop_words = 'english' ) clf = SGDClassifier ( loss = 'hinge' , penalty = 'l2' , alpha = 1e - 3 , n_iter = 5 , random_state = 42 ) text_clf = Pipeline ( [ ( 'vect' , vec ) , ( 'clf' , clf ) , ] ) . fit ( twenty_train . data , twenty_train . target ) print ( text_clf . score ( twenty_test . data , twenty_test . target ) )

Nice, the accuracy is improved from 0.796 to 0.819. If we check model weights using "eli5.show_weights" we'll see that "of" word is no longer in a table. Let's also check it on a concrete example:

Mmm, it looks like many "background" words are no longer highlighted, but some of them still are. For example, "don" in "don't" is green, and "weren" in "weren't" is also green. It looks suspicious, and indeed - we've spotted an issue with scikit-learn 0.18.1 and earlier: stop words list doesn't play well with the default scikit-learn tokenizer. Tokenizer splits contractions (words like "don't") into two parts, but stop words list doesn't include first parts of these contractions.

Let's add such tokens to the stop words list:

stop_words = ENGLISH_STOP_WORDS . copy ( ) | { 'weren' , 'don' , 'isn' , 'couldn' , 'wasn' } vec = TfidfVectorizer ( stop_words = stop_words ) clf = SGDClassifier ( loss = 'hinge' , penalty = 'l2' , alpha = 1e - 3 , n_iter = 5 , random_state = 42 ) text_clf = Pipeline ( [ ( 'vect' , vec ) , ( 'clf' , clf ) , ] ) . fit ( twenty_train . data , twenty_train . target ) text_clf . score ( twenty_test . data , twenty_test . target )

Accuracy improved a tiny bit - 0.820 instead of 0.819.

A lesson learned: by looking at model weights and prediction explanations it is possible to spot preprocessing bugs. This particular bug was there in scikit-learn for many years, but it was only reported recently, while it was easy for us to find this bug just by looking at "eli5.explain_prediction" result. If you're a reader from the future then maybe this issue is already fixed; examples use scikit-learn 0.18.1.

We may also notice that last parts of contractions ("t" in "don't") are not highlighted, unlike first parts ("don"). But "t" is not in the stop words list, just like "don". What's going on? The reason is that default scikit-learn tokenizer removes all single-letter tokens. This piece of information is not mentioned in scikit-learn docs explicitly, but the gotcha becomes visible if we inspect the prediction result. By looking at such explanations you may get a better understanding of how a library works.

Another lesson is that even with bugs the pipeline worked overall; there was no indication something is wrong, but after fixing the issue we've got a small quality improvement. Systems based on Machine Learning are notably hard to debug; one of the reasons is that they often can adapt to such software bugs - usually it just costs us a small quality drop. Any additional debugging and testing instrument is helpful: unit tests, checking of the invariants which should hold, gradient checking, etc.; looking at model weights and inspecting model predictions is one of these instruments.

N-grams

So far we've only used individual words as features. There are other ways to extract features from text. One common way is to use "n-grams" - all subsequences of a given length. For example, in a sentence "The quick brown fox" word 2-grams (word bigrams) would be "The quick", "quick brown" and "brown fox". So instead of having a parameter per individual word we could have a parameter per such bigram. Or, more commonly, we may have parameters both for individual words and n-grams. It allows to "catch" short phrases - often only a phrase has a meaning, not individual words it consists of.

It is also possible to use "char n-grams" - instead of splitting text into words one can use a "sliding window" of a given length. "The quick brown fox" can be converted to a char 5-gram as "The q", "the qu", "he qui", etc. This approach can be used when one want to make classifier more robust to word variations, typos, and to make a better use of related words.

scikit-learn provides a way to extract these n-gram features; let's check how it works, and what the model learns. The code looks almost the same as before; the only change is added "ngram_range" and "analyzer='char'" TfidfVectorizer arguments:

stop_words = ENGLISH_STOP_WORDS . copy ( ) | { 'weren' , 'don' , 'isn' , 'couldn' , 'wasn' } vec = TfidfVectorizer ( stop_words = stop_words , ngram_range = ( 3 , 5 ) , analyzer = 'char' ) clf = SGDClassifier ( loss = 'hinge' , penalty = 'l2' , alpha = 1e - 3 , n_iter = 5 , random_state = 42 ) text_clf = Pipeline ( [ ( 'vect' , vec ) , ( 'clf' , clf ) , ] ) . fit ( twenty_train . data , twenty_train . target ) text_clf . score ( twenty_test . data , twenty_test . target )

Score became worse (0.792), so probably word-based approach works better. This is what coefficients look like:

You can see word chunks, but overall parameters are less readable and inspectable. Prediction:

N-grams are overlapping; individual characters are highlighted according to weights of all ngrams they belong to. It is now more clear which parts of text are important; it seems char n-grams make it all a bit more noisy.

By the way, haven't we removed stop words already? Why are hey still highlighted? We're passing stop_words argument to TfidfVectorizer as before, but it seems this argument does nothing now. And it indeed does nothing - scikit-learn ignores stop words when using char n-grams; this is documented, but still easy to miss.

So maybe char n-grams are not that much worse than words for this data - accuracy of a word-based model without stop words removal is similar (0.796 instead of 0.792). It could be the case that after removing stop words and tuning optimal SGDClassifier parameters (which are likely different for char-based features) we can get a similar or better quality. We still don't know if this is true, but at least after the inspection we've got some starting points.

Why Look Inside?

In this article we followed scikit-learn text processing tutorial and built upon it, but in addition to using common sense and checking validation scores we looked inside the classifier using eli5 library. As a result, we:

found an issue with training data and an issue with task specification - model was using author names and emails instead of classifying messages based only on their content; found a bug in scikit-learn - its default stop words list doesn't contain contractions; found a possible bug in our code - "stop_words" argument of TfidfVectorizer does nothing is "analyzer="char" is used, which is easy to miss; got better understanding of how the whole text processing pipeline works.

scikit-learn docs are tutorials are top-notch; they can be easily the best among all ML software library docs and tutorials, and our findings don't change that. Such problems are common in a real world: small processing bugs, misunderstandings; there were similar data issues in every single real-world project I've worked on. Of course, you can't detect and fix all problems by looking inside models and their predictions, but with eli5 at least you have better chances for spotting such problems.

We've been using these techniques for many projects: model inspection is a part of data science work in our team, being it classification tasks, Named Entity Recognition or adaptive crawlers based on Reinforcement Learning. Explanations are not only useful for developers, they are helpful for users of your system as well - users get better understanding of how a system works, and can either trust it more, or become aware of its limitations - see this blog post from our firends Hyperion Gray for a practical example.

eli5 library is not limited to linear classifiers and text data; it supports several ML frameworks (scikit-learn, xgboost, LightGBM, etc.) and implements several model explanation methods, both model-specific and model-agnostic. Library is improving; we're trying to get most proven explanation methods available in eli5. But even with all its features it barely scratches the surface; there is a lot of research going on, and it is exciting to see how this field develops.