There are many ways to use Natural Language Processing, also known as NLP. In this blog we will talk about count vectorizers and how this can be useful when making models.

Lets start with what is NLP. It is a way to make words into numerical values so we can analyze and make predictive models based on that data.

I like to show by example, so today we will read in dataset from sklearn with different emails based on categories. We will only read in four of the categories and practice using nlp on them. We will make a model to determine whether an email will fall into a certain category or not.

Firstly, we need to read in some of our dictionaries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline



# Getting that SKLearn Dataset

from sklearn.datasets import fetch_20newsgroups

This give us a few libraries to use and imports the data set from sklearn with our email information.

categories = [

'alt.atheism',

'talk.religion.misc',

'comp.graphics',

'sci.space',

]

# Setting out training data

data_train = fetch_20newsgroups(subset='train',categories=categories,

shuffle=True, random_state=42,

remove=('headers', 'footers', 'quotes'))

# Setting our testing data

data_test = fetch_20newsgroups(subset='test', categories=categories,

shuffle=True, random_state=42,

remove=('headers', 'footers', 'quotes'))

In this code block we are determining the 4 categories we want. Then making our training set and our testing set. We set categories equal to our above four categories, and removed the headers, footers, and quotes.

data_train.keys() #tells us what keys are in our dataset len(data_train['data']) #looks at the length of the data_train

len(data_train['target'])#looks at the length of the data_test

We are making sure that our Data and Target columns are of equal length, otherwise we will have errors when modeling.

from sklearn.feature_extraction.text import CountVectorizer



# Setting the vectorizer just like we would set a model

cvec = CountVectorizer(stop_words='english') # Fitting the vectorizer on our training data cvec.fit(data_train['data'])

We set cvec equal to CountVectorizer so we can easily call it in the future. Additionally I added in stop_words = ‘english’. What this does is take our common english words such as , ‘the’, ‘a’, ‘and’, etc. This is useful so we don’t base our model on these words that don’t actually have predictive meaning. Then it needs to be fit on the data_train[‘data’].

X_train = pd.DataFrame(cvec.transform(data_train['data']).todense(),

columns=cvec.get_feature_names())

This step is transforming the training data. The .todense() will make this into a matrix (and easily convertible to a pandas DataFrame). Making the columns = cvec.get_feature_names() will make each column equal to a different word that we are analyzing.

# Which words appear the most.

word_counts = X_train.sum(axis=0)

word_counts.sort_values(ascending = False).head(20)

I then made a word counter that will look through our transformed X_train and count the words up, and making sort_values(ascending = False) will start from the highest count words and go to the lowest.

names = data_train['target_names']

names is now equal to the 4 categories we initially read into our DataFrame.

X_test = pd.DataFrame(cvec.transform(data_test['data']).todense(),

columns=cvec.get_feature_names())

We then need to transform our X_test like we did above with our X_train.

y_test = data_test['target']

and set our y_test to be able to test how accurate our model is

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)

lr.score(X_test, y_test)

Our final step was to fit our model on the X_train and y_train and then to test our model on the X_test and y_test. lr.score(X_test, y_test) lets us know how well our model performed. In this case is was about 75% accurate.