In this post, we’re going to dive deep into one of the most popular and simple machine learning classification algorithms—the Naive Bayes algorithm, which is based on the Bayes Theorem for calculating probabilities and conditional probabilities.

Before we jump into the Naive Bayes classifier/algorithm, we need to know the fundamentals of Bayes Theorem, on which it’s based.

Bayes Theorem

Given a feature vector X=(x1,x2,…,xn) and a class variable Ck, Bayes Theorem states that:

P(Ck|X)=P(X|Ck)P(Ck)/(P(X)), for k=1,2,…,K

We call P(Ck∣X) the posterior probability, P(X∣Ck) the likelihood, P(Ck) the prior probability of a class, and P(X) the prior probability of the predictor. We’re interested in calculating the posterior probability from the likelihood and prior probabilities.

Using the chain rule, the likelihood P(X∣Ck) can be decomposed as:

The term “Naive” and feature independence

It’s quite tedious to calculate the set of probabilities for all values again and again. Fortunately, with the naive conditional independence assumption, the conditional probabilities are independent of each other.

In terms of machine learning, we mean to say that the features provided to us are independent and do not affect each other, and this does not happen in real life. The features depend on the occurrence or value of another, which is simply ignored by the Naive Bayes classifier and is hence given the term, “NAIVE”.

Thus, by conditional independence, we have:

And the posterior probability can then be written as:

since the denominator remains constant for all values.