One of the most crucial aspects of machine learning is understanding the mathematics & statistics behind it. In my journey to becoming a data scientist, I wanted to master not only the theoretical aspects of math & stats but also understand how I could apply them to my area of work.

There is an ever-increasing number of machine learning algorithms, and this post is going to focus on one of my favorites — the Naive Bayes algorithm. Specifically, I’m going to break this exploration into two parts—the first part is going to broadly cover the Naive Bayes algorithm and how it can be applied in text classification. And the second part of it is going to focus on building a REST API from the model we create in Part I. So stay tuned and enjoy!

The objectives of this blog post are to:

Understand Bayesian methods and how they work.

Perform text classification with the Naive Bayes algorithm.

Evaluate the model’s performance.

What are Bayesian Methods?

Since the algorithm we’re going to use in this post is Naive Bayes, it makes sense to talk about the Bayesian methods underlying the algorithm itself.

Bayesian methods are methods of statistical inference that establish the fact that an individual’s prior beliefs or opinions about an event may be updated/changed given the presence of new evidence/data.

Bayesian statistics provide a firm mathematical procedure for putting together prior opinions and new evidence to produce new posterior opinions. This is in sharp contrast to frequentist statistics, which deals with the fact that probabilities are a measure of frequencies of random events in a repeated number of trials.

Now that we’ve got a basic introduction to Bayesian statistics, let’s see how we can use this to derive the popular Bayes’ theorem. If you already know how to do this, feel free to skip this part.

Bayes’ Theorem

Bayes’ theorem is a powerful theorem that’s changed the way we think about probabilities. As complex as it may sound, it’s simply based on conditional probabilities.

From conditional probability, we know that the probability of A given B, mathematically denoted by P(A|B), is equal to the probability of A’s intersection with B divided by the probability of B:

P(A|B) = P(A n B) / P(B) ………………………equation 1

Similarly:

P(B|A) = P(B n A) / P(A) …………………………equation 2

From equation 1 we can derive:

P(A n B) = P(A|B) x P(B) ………………………….equation 3

Similarly, from equation 2 we can derive:

P(B n A) = P(B|A) x P(A) …………………………..equation 4

Since P(A n B) = P(B n A), we can equate equation 3 & equation 4 and arrange the final equation to yield the Bayes’ Theorem, which is:

P(A|B) = P(B|A) x P(A) / P(B)

Now that we’ve looked at the underlying Bayes’ Theorem, let’s briefly discuss how this works in the Naive Bayes algorithm specifically.

At this point, you might be asking, why the name Naive Bayes? What actually makes Naive Bayes…Naive? A Naive Bayes classifier works by asking “Given a set of features, which class does a measurement belong to—say, class 1 or class 0?”

Naive Bayes is naive because it assumes that features in a given measurement are independent, which is in reality never the case. This assumption is strong but extremely useful. It’s what makes this model work well with little data or data that may be mislabeled. Additional advantages of the Naive Bayes algorithm are that it’s very fast and scalable.

From the Bayes’ Theorem, we can replace A and B with X and Y, representing the feature matrix and the response vector respectively, and this would yield:

P(X|y) = P(y|X)*P(X)/P(y)

Which can be re-arranged and extended to:

P(y=k |X) = P(X|y = k)*P(y = k)/P(X)

This can be interpreted as the probability of predicting a target with class k given feature matrix X, and is given by the probability of predicting feature matrix X given a certain class of y times the probability of belonging to a certain class k.

And of course, we know in real life there would be multiple features, so we can extend the Bayes’ Theorem “naively”, assuming the features are independent, into:

P(y=k|X1..Xn) = P(X1|y=k)*P(X2|y=k)*P(X3|y=k)..*P(Xn|y=k)P(y=k)/P(X)

The above should serve as enough of an intro to the Naive Bayes algorithm for us to get started. Now it’s time to work with our hands-on example.