Photo by Stephen Dawson on Unsplash

The comments are one of the most toxical things on the internet, but there are also some positive ones, right? — How about to create a sentiment analyzer based on text classificiation using artificial intelligence?

Let’s check it out.

The question we going to trying to answer is: A particular commentary express a positive sentiment or a negative one?

Maybe you are not familiar with AI and machine learning concepts, so, let me explain in a extremely summarized way, how it works.

When we are talking about machine learning it’s relatively similar to our own learning process.

How can you recognize if an animal is a cat or a dog?

I know it looks weird to think about it, but it’ll make sense. The first time you saw that animal, someone told you: this is a cat.

This is a cat.

Then your brain associate some characteristics of that animal even without you noticing that.

After some exposures of different types of cats, you’ll be able to recognize any cat, including abstractions like a drawn cat.

Still a cat…

Therefore, is pretty fair to say that our experiences shape the way we see things. An artificial intelligence model also can have those experiences, but instead of actual facts, we going to use data.

So, our first step is collect data about commentaries, fortunately there are several open data sets provided by some companies. We going to use three different data sets: IMDb, Amazon and Yelp.

You can access them in my GitHub or in my Azure Notebook project, which you going to follow.

Now, it’s time to create the environment to develop your solution. The Azure Notebooks will works fine to accomplish this task.

After create your account you’ll be able to create a new project.

Azure Notebook projects works like a GitHub repository in some aspects, they can be public or private and there’s a README.md file, with the exact same purpose.

The new project window it’s pretty self explanatory and looks like the image bellow:

Now you are at the project environment and you’ll be able to create new folders, files and notebooks. As I said before, it’s like a repository, so, you can see the Star and Clone buttons.

Let’s create a folder called Data and upload our data sets, like the image bellow:

Planning a AI algorithm solution

In order to solve this problem we’ll implement a classification algorithm, we going to use a Python lib called Scikit-learn to make it easier. This lib contains some AI algorithms implementations, including some classification ones.

You can split AI solutions in three different steps:

Pre processing; Training; Evaluation;

The first step is about to collect, clean and organize your data. Fortunately, the selected data sets are pretty clean already.

The second step is related to use your cleaned data sets to train your model through examples. By doing that, you’ll generate a classifier model.

This model contains a predict function which can receive any new commentary and classify it as a positive or negative one.

The last step is when you validate if your model is good enough to solve the purposed problem. There’s no step by step validation, because it depents entirely on the purposed problem.

Let’s code our algorithm

Before all steps above mentioned you need to create the notebook itself, just click the add button and select the first option: Notebook.

The New Notebook window is also pretty simple, and for us, it’s only about to choose a name and select Python 3.6 as the notebook language.

Let’s start with the libs imports needed to create this solution:

In Azure Notebook you can mix markdown headers and code to make it more organized:

Now we can code function that will open all data sets and merge them as one. It’s a pretty simple with Python:

You need to notice that the root variable contains the folder name of the data sets, which is Data in my case, but feel yourself free to use any folder name you want.

Other interesting thing here, is the fact that we used the line breaker scape(

) to split our data into an array.

Another cool thing about using a notebook to store your code is the fact that you can interact with it, for instance, we already can call the get_all_data function and see the result:

Speaking of the get_all_data result, let’s check some data of our data sets:

Wow... Loved this place. 1

Wasted two hours. 0

It’s pretty fair to assume that the first one is a positive comment about a place, maybe a restaurant, while the second is a negative one about a movie who during two hours.

But let’s pay attention about the data structure, there’s a text commentary, followed by a tab ( \t ) and a number that represents the actual classification of each commentary.

We’ll split each line by the tab value ( \t ), this way we going to create an array where the position 0 contains the text and position 1 contains the classification. It’ll make our work easier later.

As I said before, we can run each function individually, so, let’s check the result out:

The last task to finish the first step is to split the data set into two different sets, one for each of the next process: training and evaluation.

We need to use a ratio to split the data out, usually it can variate between 70–80% for training and 30–20% for evaluation, but in this particular case, we’ll use 75% for training, so, let’s split it:

The code you can see above split the data based on training_ratio variable, so, after we finish the experiment you can back here, change this value and see how it’ll performs.

Now there are two different data sets and we are about to ready to the next step: the training process.

Before do it, let’s simplify the whole first step as a single function:

A little step back to understanding the theory

Before we implement the training process is important to understand how it works under the hood, right? — Without it, it’ll be just a magic solution implemented by an external lib.

We need to identify a sentiment based on text, how can we do it? —The answer is: term frequency.

In order to implement it, we’ll need first, create a list of all knowing words by our algorithm. Let’s use a smaller version of our data set. Let’s imagine that all words known by our model is:

hello, this, is, a, good, list, for, test

Now, we need to input some data text with these words, let’s try:

this is a good test

The process here is pretty simple, we going to create a new list by replacing all knowing words by the number of times they appears in the input, like the image below:

This works like a conversion from text to tangible math values, which is by far, more easy to work. Let’s make a test with the good word.

This word sounds like a positive one, right? But how our algorithm can know it?

It’s just about a simple math, our model will count, how many times the word good appear in a positive sentence and divide this number by how many times the word appear at all. It’ll create the positive score to this word.

This process is called vectorization.

To calculate the negative score you can simply use 1- Positive Score.

After these two scores are calculated, the Naive Bayes algorithm will use them to calculate the sentence score.

This algorithm evaluate each word separately without any context, this is the reason of containing naive in your name.

Now, it’s enough about theory, let’s back to the code!

Coding the training process

We don’t need to be worried about all details above mentioned, we can assume that we’ll receive a vectorizer capable of convert our text to numeric values.

So let’s split our training data set in two different lists, one for the sentences and other for the results.

Then, let’s vectorize the sentence by using fit_transform function and finally let’s return the result of a BernoulliNB().fit function call.

The BernoulliNB (NB for Naive Bayes) will generate our classification model. Now we can use the predict function of this model to try to predict the sentiment of any text, let’s try an example:

Let’s see how it works in our Notebook:

We’ve tested the “I love this movie!” a positive sentence and the prediction result was ‘1’, which we have seen that represents a positive, so the result was predicted correctly.

It’s working right now, but let’s improve the result visualization with two different functions:

Now it’s a lot more easy to visualize our tests:

Cool, the work is done, right? — Wrong!

We need to evaluate how our model performs. The most intuitive way to do that is calculate the percentage of right answers, so, let’s do it.

Evaluation

It’s time to use the previously created, evaluation data set. Our first validation is about to check our predictions answer against the actual answer counting the right answers.

Let me explain, we already know the right answer of each sentence in evaluation data set, do you remember the data structure? A sentence and the classification result splitted by an \t .

So, we going to iterate through all data by using our model to predict the sentiment analysis of each sentence, then, we’ll compare the model predicted result against the actual result in the data set.

Let’s check the result:

The 82% looks like a good result, but how can we get more information about our model?

The Confusion Matrix

Fortunately, there are a lot more techniques to get information about prediction models. We going to use the confusion matrix.

We can generate this matrix by ourselves or by using a function at sklearn.metrics . We going to use the lib implementation because there’s no secret about how generate a confusion matrix, actually is quite easy.

But before we starts, we need to understand what the confusion matrix is?

The confusion matrix concept is pretty simple, we going to store in a single matrix a kind of combination of expected results versus predicted results.

In our particular case, this matrix will contains four different values:

Negative predictions when the actual result is negative;

Negative predictions when the actual result is positive;

Positive predictions when the actual result is negative;

Positive predictions when the actual result is positive;

Two of these values indicates right predictions and the others two indicates wrong ones. This matrix is extremely useful, because in several cases, there’s no big problem to create a false negative as creating a false positive or vice-versa.

So, everytime you can assume some honest mistake, you’ll need to check when your model is committing mistakes.

Let’s create the confusion matrix:

Now, we can use pandas Data Frame to improve the visualization:

This code generate a pretty cool matrix, like image below:

But it was not cool enought, we can use the famous matplotlib.pyplot lib to create an even better visualization.

This code is not so important, because isn’t directly related to our solution, it’s just a way to visualize our data, so, don’t worry about it.

Let’s check how the generated visualization looks like:

Notice that, we can find the total of negatives and positives results by summing the rows, respectively.

Then, there are a total of 386 negatives sentences and a total of 364 of negatives, totalizing 750 sentences, which is 25% of the 3000 total sentences.

Of this 386 negatives sentences 322 were classified correctly while 64(about 16.5%) were classified as positive sentence, creating some false positives. On other hand there’s 293 positives sentences classified correctly and 71(about 24%) false negatives.

Now you are able to take correct decisions about restructure and improve your model.