Naive Bayes is a machine learning algorithm for classification problems. It is based on Bayes’ probability theorem. It is primarily used for text classification which involves high dimensional training data sets. A few examples are spam filtration, sentimental analysis, and classifying news articles.

It is not only known for its simplicity, but also for its effectiveness. It is fast to build models and make predictions with Naive Bayes algorithm. Naive Bayes is the first algorithm that should be considered for solving text classification problem. Hence, you should learn this algorithm thoroughly.

Table of Contents

Basics of Naive Bayes The mathematics of the Naive Bayes Variations of Naive Bayes Advantages and Disadvantages Python and R implementation Applications of Naive Bayes

What is Naive Bayes algorithm?

Naive Bayes algorithm is the algorithm that learns the probability of an object with certain features belonging to a particular group/class. In short, it is a probabilistic classifier. You must be wondering why is it called so?

The Naive Bayes algorithm is called “naive” because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.

For instance, if you are trying to identify a fruit based on its color, shape, and taste, then an orange colored, spherical, and tangy fruit would most likely be an orange. Even if these features depend on each other or on the presence of the other features, all of these properties individually contribute to the probability that this fruit is an orange and that is why it is known as “naive.”

As for the “Bayes” part, it refers to the statistician and philosopher, Thomas Bayes and the theorem named after him, Bayes’ theorem, which is the base for Naive Bayes Algorithm.

The Mathematics of the Naive Bayes Algorithm

As already said, the basis of Naive Bayes algorithm is Bayes’ theorem or alternatively known as Bayes’ rule or Bayes’ law. It gives us a method to calculate the conditional probability, i.e., the probability of an event based on previous knowledge available on the events. More formally, Bayes’ Theorem is stated as the following equation:

\(\displaystyle P(A|B)=\frac{P(B|A)P(A)}{P(B)}\)

Let us understand the statement first and then we will look at the proof of the statement. The components of the above statement are:

\(P(A|B)\): Probability (conditional probability) of occurrence of event \(A\) given the event \(B\) is true

\(P(A)\) and \(P(B)\): Probabilities of the occurrence of event \(A\) and \(B\) respectively

\(P(B|A)\): Probability of the occurrence of event \(B\) given the event \(A\) is true

The terminology in the Bayesian method of probability (more commonly used) is as follows:

\(A\) is called the proposition and \(B\) is called the evidence.

and \(B\) is called the \(P(A)\) is called the prior probability of proposition and \(P(B)\) is called the prior probability of evidence.

probability of proposition and \(P(B)\) is called the probability of evidence. \(P(A|B)\) is called the posterior.

\(P(B|A)\) is the likelihood.

This sums the Bayes’ theorem as

\(\mbox{Posterior}=\frac{\mbox{(Likelihood)}.\mbox{(Proposition prior probability)}}{\mbox{Evidence prior probability}}\)

Let us take an example to better understand Bayes’ theorem.

Suppose you have to draw a single card from a standard deck of 52 cards. Now the probability that the card is a Queen is \(P\left(\mbox{Queen}\right)=\frac{4}{52}=\frac{1}{13}\). If you are given evidence that the card that you have picked is a face card, the posterior probability \(P(\mbox{Queen}|\mbox{Face})\) can be calculated using Bayes’ Theorem as follows:

\(\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{P(\mbox{Face}|\mbox{Queen})}{P(\mbox{Face})}.P(\mbox{Queen})\)

Now \(P(\mbox{Face}|\mbox{Queen})=1\) because given the card is Queen, it is definitely a face card. We have already calculated \(P\left(\mbox{Queen}\right)\). The only value left to calculate is \(P\left(\mbox{Face}\right)\), which is equal to \(\frac{3}{13}\) as there are three face cards for every suit in a deck. Therefore,

\(\displaystyle P(\mbox{Queen}|\mbox{Face})=\frac{1}{13}.\frac{13}{3}=\frac{1}{3}\)

Derivation of Bayes’ Theorem

For a joint probability distribution of two events \(A\) and \(B\), \(P(A\cap B)\), the conditional probability,

\(\displaystyle P(A|B)=\frac{P(A\cap B)}{P(B)}\)

Similarly,

\(\displaystyle P(B|A)=\frac{P(B\cap A)}{P(A)}\)

Therefore,

\(\displaystyle P(B|A).P(A)=P(A|B).P(B)\implies P(A|B)=\frac{P(B|A).P(A)}{P(B)}\)

Bayes’ Theorem for Naive Bayes Algorithm

In a machine learning classification problem, there are multiple features and classes, say, \(C_1, C_2, \ldots, C_k \). The main aim in the Naive Bayes algorithm is to calculate the conditional probability of an object with a feature vector \(x_1, x_2,\ldots, x_n\) belongs to a particular class \(C_i\),

\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\frac{P(x_1, x_2,\ldots, x_n|C_i).P(C_i)}{P(x_1, x_2,\ldots, x_n)}\) for \(1\leq i\leq k\)

Now, the numerator of the fraction on right-hand side of the equation above is \(\displaystyle P(x_1, x_2,\ldots, x_n|C_i).P(C_i)=P(x_1, x_2,\ldots, x_n, C_i)\)

The conditional probability term, \(P(x_j|x_{j+1},\ldots, x_n, C_i)\) becomes \(P(x_j|C_i)\) because of the assumption that features are independent.

From the calculation above and the independence assumption, the Bayes theorem boils down to the following easy expression:

\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)=\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).\frac{P(C_i)}{P(x_1, x_2,\ldots, x_n)}\) for \(1\leq i\leq k\)

The expression \(P(x_1, x_2,\ldots, x_n)\) is constant for all the classes, we can simply say that

\(\displaystyle P(C_i|x_1, x_2,\ldots, x_n)\propto\left(\prod_{j=1}^{j=n}P(x_j|C_i)\right).P(C_i)\) for \(1\leq i\leq k\)

How does the Naive Bayes Algorithm work?

So far, we learned what the Naive Bayes algorithm is, how the Bayes theorem is related to it, and what the expression of the Bayes’ theorem for this algorithm is. Let us take a simple example to understand the functionality of the algorithm. Suppose, we have a training data set of 1200 fruits. The features in the data set are these: is the fruit yellow or not, is the fruit long or not, and is the fruit sweet or not. There are three different classes: mango, banana, and others.

Step 1: Create a frequency table for all the features against the different classes.

Name Yellow Sweet Long Total Mango 350 450 0 650 Banana 400 300 350 400 Others 50 100 50 150 Total 800 850 400 1200

What can we conclude from the above table?

Out of 1200 fruits, 650 are mangoes, 400 are bananas, and 150 are others.

350 of the total 650 mangoes are yellow and the rest are not and so on.

800 fruits are yellow, 850 are sweet and 400 are long from a total of 1200 fruits.

Let’s say you are given with a fruit which is yellow, sweet, and long and you have to check the class to which it belongs.

Step 2: Draw the likelihood table for the features against the classes.

Name Yellow Sweet Long Total Mango 350/800=P(Mango|Yellow) 450/850 0/400 650/1200=P(Mango) Banana 400/800 300/850 350/400 400/1200 Others 50/800 100/850 50/400 150/1200 Total 800=P(Yellow) 850 400 1200

Step 3: Calculate the conditional probabilities for all the classes, i.e., the following in our example:

Step 4: Calculate \(\displaystyle\max_{i}{P(C_i|x_1, x_2,\ldots, x_n)}\). In our example, the maximum probability is for the class banana, therefore, the fruit which is long, sweet and yellow is a banana by Naive Bayes Algorithm.

In a nutshell, we say that a new element will belong to the class which will have the maximum conditional probability described above.

Variations of the Naive Bayes algorithm

There are multiple variations of the Naive Bayes algorithm depending on the distribution of \(P(x_j|C_i)\). Three of the commonly used variations are

Gaussian: The Gaussian Naive Bayes algorithm assumes distribution of features to be Gaussian or normal, i.e.,

\(\displaystyle P(x_j|C_i)=\frac{1}{\sqrt{2\pi\sigma_{C_i}^2}}\exp{\left(-\frac{(x_j-\mu_{C_j})^2}{2\sigma_{C_i}^2}\right)}\)

Read more about it here. Multinomial: The Multinomial Naive Bayes algorithm is used when the data is distributed multinomially, i.e., multiple occurrences matter a lot. You can read more here. Bernoulli: The Bernoulli algorithm is used when the features in the data set are binary-valued. It is helpful in spam filtration and adult content detection techniques. For more details, click here.

Pros and Cons of Naive Bayes algorithm

Every coin has two sides. So does the Naive Bayes algorithm. It has advantages as well as disadvantages, and they are listed below:

Pros

It is a relatively easy algorithm to build and understand.

It is faster to predict classes using this algorithm than many other classification algorithms.

It can be easily trained using a small data set.

Cons

If a given class and a feature have 0 frequency, then the conditional probability estimate for that category will come out as 0. This problem is known as the “Zero Conditional Probability Problem.” This is a problem because it wipes out all the information in other probabilities too. There are several sample correction techniques to fix this problem such as “Laplacian Correction.”

Another disadvantage is the very strong assumption of independence class features that it makes. It is near to impossible to find such data sets in real life.

Naive Bayes with Python and R

Let us see how we can build the basic model using the Naive Bayes algorithm in R and in Python.

R Code

To start training a Naive Bayes classifier in R, we need to load the e1071 package.

library(e1071)

To split the data set into training and test data we will use the caTools package.

library(caTools)

The predefined function used for the implementation of Naive Bayes in R is called naiveBayes(). There are only a few parameters that are of use:

naiveBayes(formula, data, laplace = 0, subset, na.action = na.pass)

formula : The traditional formula \(Y\sim X_1+X_2+\ldots+X_n\)

: The traditional formula \(Y\sim X_1+X_2+\ldots+X_n\) data : The data frame containing numeric or factor variables

: The data frame containing numeric or factor variables laplace : Provides a smoothing effect

: Provides a smoothing effect subset : Helps in using only a selection subset of the data based on some Boolean filter

: Helps in using only a selection subset of the data based on some Boolean filter na.action : Helps in determining what is to be done when a missing value in the data set is encountered

Let us take the example of the iris data set.

> library(e1071) > library(caTools) > data(iris) > iris$spl=sample.split(iris,SplitRatio=0.7) # By using the sample.split() we are creating a vector with values TRUE and FALSE and by setting the SplitRatio to 0.7, we are splitting the original Iris dataset of 150 rows to 70% training and 30% testing data. > train=subset(iris, iris$spl==TRUE)#the subset of iris dataset for which spl==TRUE > test=subset(iris, iris$spl==FALSE) > nB_model <- naiveBayes(train[,1:4], train[,5]) > table(predict(nB_model, test[,-5]), test[,5]) #returns the confusion matrix setosa versicolor virginica setosa 17 0 0 versicolor 0 17 2 virginica 0 0 14

Python Code

We will use the Python library scikit-learn to build the Naive Bayes algorithm.

>>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.naive_bayes import MultinomialNB >>> from sklearn import datasets >>> from sklearn.metrics import confusion_matrix >>> from sklearn.model_selection import train_test_split >>> iris = datasets.load_iris() >>> X = iris.data >>> y = iris.target # Split the data into a training set and a test set >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) >>> gnb = GaussianNB() >>> mnb = MultinomialNB() >>> y_pred_gnb = gnb.fit(X_train, y_train).predict(X_test) >>> cnf_matrix_gnb = confusion_matrix(y_test, y_pred_gnb) >>> print(cnf_matrix_gnb) [[16 0 0] [ 0 18 0] [ 0 0 11]] >>> y_pred_mnb = mnb.fit(X_train, y_train).predict(X_test) >>> cnf_matrix_mnb = confusion_matrix(y_test, y_pred_mnb) >>> print(cnf_matrix_mnb) [[16 0 0] [ 0 0 18] [ 0 0 11]]

Applications

The Naive Bayes algorithm is used in multiple real-life scenarios such as

Text classification: It is used as a probabilistic learning method for text classification. The Naive Bayes classifier is one of the most successful known algorithms when it comes to the classification of text documents, i.e., whether a text document belongs to one or more categories (classes). Spam filtration: It is an example of text classification. This has become a popular mechanism to distinguish spam email from legitimate email. Several modern email services implement Bayesian spam filtering.

Many server-side email filters, such as DSPAM, SpamBayes, SpamAssassin, Bogofilter, and ASSP, use this technique. Sentiment Analysis: It can be used to analyze the tone of tweets, comments, and reviews—whether they are negative, positive or neutral. Recommendation System: The Naive Bayes algorithm in combination with collaborative filtering is used to build hybrid recommendation systems which help in predicting if a user would like a given resource or not.

Conclusion

This article is a simple explanation of the Naive Bayes Classification algorithm, with an easy-to-understand example and a few technicalities.

Despite all the complicated math, the implementation of the Naive Bayes algorithm involves simply counting the number of objects with specific features and classes. Once these numbers are obtained, it is very simple to calculate probabilities and arrive at a conclusion.

Hope you are now familiar with this machine learning concept you most like would have heard of before.