This post describes the implementation of sentiment analysis of tweets using Python and the natural language toolkit NLTK. The post also describes the internals of NLTK related to this implementation.

Background

The purpose of the implementation is to be able to automatically classify a tweet as a positive or negative tweet sentiment wise.

The classifier needs to be trained and to do that, we need a list of manually classified tweets. Let’s start with 5 positive tweets and 5 negative tweets.

Positive tweets:

I love this car.

This view is amazing.

I feel great this morning.

I am so excited about the concert.

He is my best friend.

Negative tweets:

I do not like this car.

This view is horrible.

I feel tired this morning.

I am not looking forward to the concert.

He is my enemy.

In the full implementation, I use about 600 positive tweets and 600 negative tweets to train the classifier. I store those tweets in a Redis DB. Even with those numbers, it is quite a small sample and you should use a much larger set if you want good results.

Next is a test set so we can assess the exactitude of the trained classifier.

Test tweets:

I feel happy this morning. positive.

Larry is my friend. positive.

I do not like that man. negative.

My house is not great. negative.

Your song is annoying. negative.

Implementation

The following list contains the positive tweets:

pos_tweets = [('I love this car', 'positive'), ('This view is amazing', 'positive'), ('I feel great this morning', 'positive'), ('I am so excited about the concert', 'positive'), ('He is my best friend', 'positive')]

The following list contains the negative tweets:

neg_tweets = [('I do not like this car', 'negative'), ('This view is horrible', 'negative'), ('I feel tired this morning', 'negative'), ('I am not looking forward to the concert', 'negative'), ('He is my enemy', 'negative')]

We take both of those lists and create a single list of tuples each containing two elements. First element is an array containing the words and second element is the type of sentiment. We get rid of the words smaller than 2 characters and we use lowercase for everything.

tweets = [] for (words, sentiment) in pos_tweets + neg_tweets: words_filtered = [e.lower() for e in words.split() if len(e) >= 3] tweets.append((words_filtered, sentiment))

The list of tweets now looks like this:

tweets = [ (['love', 'this', 'car'], 'positive'), (['this', 'view', 'amazing'], 'positive'), (['feel', 'great', 'this', 'morning'], 'positive'), (['excited', 'about', 'the', 'concert'], 'positive'), (['best', 'friend'], 'positive'), (['not', 'like', 'this', 'car'], 'negative'), (['this', 'view', 'horrible'], 'negative'), (['feel', 'tired', 'this', 'morning'], 'negative'), (['not', 'looking', 'forward', 'the', 'concert'], 'negative'), (['enemy'], 'negative')]

Finally, the list with the test tweets:

test_tweets = [ (['feel', 'happy', 'this', 'morning'], 'positive'), (['larry', 'friend'], 'positive'), (['not', 'like', 'that', 'man'], 'negative'), (['house', 'not', 'great'], 'negative'), (['your', 'song', 'annoying'], 'negative')]

Classifier

The list of word features need to be extracted from the tweets. It is a list with every distinct words ordered by frequency of appearance. We use the following function to get the list plus the two helper functions.

word_features = get_word_features(get_words_in_tweets(tweets))

def get_words_in_tweets(tweets): all_words = [] for (words, sentiment) in tweets: all_words.extend(words) return all_words

def get_word_features(wordlist): wordlist = nltk.FreqDist(wordlist) word_features = wordlist.keys() return word_features

If we take a pick inside the function get_word_features, the variable ‘wordlist’ contains:

<FreqDist: 'this': 6, 'car': 2, 'concert': 2, 'feel': 2, 'morning': 2, 'not': 2, 'the': 2, 'view': 2, 'about': 1, 'amazing': 1, ... >

We end up with the following list of word features:

word_features = [ 'this', 'car', 'concert', 'feel', 'morning', 'not', 'the', 'view', 'about', 'amazing', ... ]

As you can see, ‘this’ is the most used word in our tweets, followed by ‘car’, followed by ‘concert’…

To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor. The one we are going to use returns a dictionary indicating what words are contained in the input passed. Here, the input is the tweet. We use the word features list defined above along with the input to create the dictionary.

def extract_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features

As an example, let’s call the feature extractor with the document [‘love’, ‘this’, ‘car’] which is the first positive tweet. We obtain the following dictionary which indicates that the document contains the words: ‘love’, ‘this’ and ‘car’.

{'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False, 'contains(about)': False, 'contains(horrible)': False, 'contains(like)': False, 'contains(this)': True, 'contains(friend)': False, 'contains(concert)': False, 'contains(feel)': False, 'contains(love)': True, 'contains(looking)': False, 'contains(tired)': False, 'contains(forward)': False, 'contains(car)': True, 'contains(the)': False, 'contains(amazing)': False, 'contains(enemy)': False, 'contains(great)': False}

With our feature extractor, we can apply the features to our classifier using the method apply_features. We pass the feature extractor along with the tweets list defined above.

training_set = nltk.classify.apply_features(extract_features, tweets)

The variable ‘training_set’ contains the labeled feature sets. It is a list of tuples which each tuple containing the feature dictionary and the sentiment string for each tweet. The sentiment string is also called ‘label’.

[({'contains(not)': False, ... 'contains(this)': True, ... 'contains(love)': True, ... 'contains(car)': True, ... 'contains(great)': False}, 'positive'), ({'contains(not)': False, 'contains(view)': True, ... 'contains(this)': True, ... 'contains(amazing)': True, ... 'contains(enemy)': False, 'contains(great)': False}, 'positive'), ...]

Now that we have our training set, we can train our classifier.

classifier = nltk.NaiveBayesClassifier.train(training_set)

Here is a summary of what we just saw:

The Naive Bayes classifier uses the prior probability of each label which is the frequency of each label in the training set, and the contribution from each feature. In our case, the frequency of each label is the same for ‘positive’ and ‘negative’. The word ‘amazing’ appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‘positive’ label will be multiplied by 0.2 when this word is seen as part of the input.

Let’s take a look inside the classifier train method in the source code of the NLTK library. ‘label_probdist’ is the prior probability of each label and ‘feature_probdist’ is the feature/value probability dictionary. Those two probability objects are used to create the classifier.

def train(labeled_featuresets, estimator=ELEProbDist): ... # Create the P(label) distribution label_probdist = estimator(label_freqdist) ... # Create the P(fval|label, fname) distribution feature_probdist = {} ... return NaiveBayesClassifier(label_probdist, feature_probdist)

In our case, the probability of each label is 0.5 as we can see below. label_probdist is of type ELEProbDist.

print label_probdist.prob('positive') 0.5 print label_probdist.prob('negative') 0.5

The feature/value probability dictionary associates expected likelihood estimate to a feature and label. We can see that the probability for the input to be negative is about 0.077 when the input contains the word ‘best’.

print feature_probdist {('negative', 'contains(view)'): <ELEProbDist based on 5 samples>, ('positive', 'contains(excited)'): <ELEProbDist based on 5 samples>, ('negative', 'contains(best)'): <ELEProbDist based on 5 samples>, ...} print feature_probdist[('negative', 'contains(best)')].prob(True) 0.076923076923076927

We can display the most informative features for our classifier using the method show_most_informative_features. Here, we see that if the input does not contain the word ‘not’ then the positive ration is 1.6.

print classifier.show_most_informative_features(32) Most Informative Features contains(not) = False positi : negati = 1.6 : 1.0 contains(tired) = False positi : negati = 1.2 : 1.0 contains(excited) = False negati : positi = 1.2 : 1.0 contains(great) = False negati : positi = 1.2 : 1.0 contains(looking) = False positi : negati = 1.2 : 1.0 contains(like) = False positi : negati = 1.2 : 1.0 contains(love) = False negati : positi = 1.2 : 1.0 contains(amazing) = False negati : positi = 1.2 : 1.0 contains(enemy) = False positi : negati = 1.2 : 1.0 contains(about) = False negati : positi = 1.2 : 1.0 contains(best) = False negati : positi = 1.2 : 1.0 contains(forward) = False positi : negati = 1.2 : 1.0 contains(friend) = False negati : positi = 1.2 : 1.0 contains(horrible) = False positi : negati = 1.2 : 1.0 ...

Classify

Now that we have our classifier initialized, we can try to classify a tweet and see what the sentiment type output is. Our classifier is able to detect that this tweet has a positive sentiment because of the word ‘friend’ which is associated to the positive tweet ‘He is my best friend’.

tweet = 'Larry is my friend' print classifier.classify(extract_features(tweet.split())) positive

Let’s take a look at how the classify method works internally in the NLTK library. What we pass to the classify method is the feature set of the tweet we want to analyze. The feature set dictionary indicates that the tweet contains the word ‘friend’.

print extract_features(tweet.split()) {'contains(not)': False, 'contains(view)': False, 'contains(best)': False, 'contains(excited)': False, 'contains(morning)': False, 'contains(about)': False, 'contains(horrible)': False, 'contains(like)': False, 'contains(this)': False, 'contains(friend)': True, 'contains(concert)': False, 'contains(feel)': False, 'contains(love)': False, 'contains(looking)': False, 'contains(tired)': False, 'contains(forward)': False, 'contains(car)': False, 'contains(the)': False, 'contains(amazing)': False, 'contains(enemy)': False, 'contains(great)': False}

def classify(self, featureset): # Discard any feature names that we've never seen before. # Find the log probability of each label, given the features. # Then add in the log probability of features given labels. # Generate a probability distribution dictionary using the dict logprod # Return the sample with the greatest probability from the probability # distribution dictionary

Let’s go through that method using our example. The parameter passed to the method classify is the feature set dictionary we saw above. The first step is to discard any feature names that are not know by the classifier. This step does nothing in our case so the feature set stays the same.

Next step is to find the log probability for each label. The probability of each label (‘positive’ and ‘negative’) is 0.5. The log probability is the log base 2 of that which is -1. We end up with logprod containing the following:

{'positive': -1.0, 'negative': -1.0}

The log probability of features given labels is then added to logprod. This means that for each label, we go through the items in the feature set and we add the log probability of each item to logprod[label]. For example, we have the feature name ‘friend’ and the feature value True. Its log probability for the label ‘positive’ in our classifier is -2.12. This value is added to logprod[‘positive’]. We end up with the following logprod dictionary.

{'positive': -5.4785441837188511, 'negative': -14.784261334886439}

The probability distribution dictionary of type DictionaryProbDist is generated:

DictionaryProbDist(logprob, normalize=True, log=True)

The label with the greatest probability is returned which is ‘positive’. Our classifier finds out that this tweets has a positive sentiment based on the training we did.

Another example is the tweet ‘My house is not great’. The word ‘great’ weights more on the positive side but the word ‘not’ is part of two negative tweets in our training set so the output from the classifier is ‘negative’. Of course, the following tweet: ‘The movie is not bad’ would return ‘negative’ even if it is ‘positive’. Again, a large and well chosen sample will help with the accuracy of the classifier.

Taking the following test tweet ‘Your song is annoying’. The classifier thinks it is positive. The reason is that we don’t have any information on the feature name ‘annoying’. Larger the training sample tweets is, better the classifier will be.

tweet = 'Your song is annoying' print classifier.classify(extract_features(tweet.split())) positive

There is an accuracy method we can use to check the quality of our classifier by using our test tweets. We get 0.8 in our case which is high because we picked our test tweets for this article. The key is to have a very large number of manually classified positive and negative tweets.

Voilà. Don’t hesitate to post a comment if you have any feedback.