Everybody loves movies, right? In this article will be presented some basic concepts in NLP and will be shown how to implement a simple model to compare similarity between movies summaries using the libraries scikit-learn and NLTK.

The content of this article is based on this DataCamp Project and the data that will be used in this article consists of plot summaries of 100 movies from Wikipedia and IMDb.

Preparing Data

When dealing with text data we need to make some special adjustments with our dataset, such as tokenization, stemming and vectorization. We’ll see what each one of these steps does.

Tokenization

Given a sequence of characters, tokenization is the process of breaking it in basic semantic units with a useful and basic meaning. This pieces are called tokens. For example: the sentence “How’s your day?” can be splitted into the tokens “how”, “‘s”, “your”, “day” and “?”. Each one with a specific meaning.

Stemming

Consider the sentences: “We need more computational power” and “We need computers to be more powerful”. They both have similar meanings, but the way the main words are written are different (computational/computer, power/powerful). We can reduce the inflectional forms of these words into a root or base form, such as “comput” and “power” which have the same meaning. In that way, the size of the vocabulary is reduced making it easier for the model to train on a limited dataset.

Vectorization

Computers can’t process anything but numbers. In order to realize any operations with text, we need first to transform the text to numbers. This process is called vectorization, and as the name suggests, they are organized in vectors. There’s multiple ways of vectorizing text. Here we’ll be using Bag of Words and TF-IDF.

Bag of Words(BoW) : Each sentence is represented as a vector with a fixed size equal to the number of words in the vocabulary with each position representing one specific word. The value of this positions is the number of occurences of that specific word in the sentence.

: Each sentence is represented as a vector with a fixed size equal to the number of words in the vocabulary with each position representing one specific word. The value of this positions is the number of occurences of that specific word in the sentence. Term Frequency-Inverse Document Frequency(TF-IDF): When we need to compare the similarity between two documents, it’s helpful to assign a measure of importance for each word in the documents, so we can focus on specific parts. TF-IDF consists in finding this “importance” by a product of two terms: The TF term tells the frequency of the word in the document, while the IDF term is about how rarely are documents with that specific word. The basic idea is: If the word appears a lot in very few documents, it must be important. If it appears a lot in many documents, it must not be important(“the”, “of”, “a”, for example).

The Code

We are now ready to implement these concepts. First we’ll import some libraries such as Pandas to manipulate our data, scikit-learn to create our pipeline and train our model and NLTK for natural language for the tokenizer, stopwords and stemming.

The CSV file is loaded into a Pandas DataFrame. The movies dataset have two columns for plot: wiki_plot and imdb_plot . We’ll concatenate they both into a new column called plot so we can have a single column with more information.

import numpy as np

import pandas as pd

import re

import nltk movies_df = pd.read_csv('datasets/movies.csv') movies_df['plot'] = movies_df['wiki_plot'].astype(str) + "

" + movies_df['imdb_plot'].astype(str)

Then is defined the function normalize that will tokenize, stem and filter special characters in each word of each document of our dataset.

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords=False) def normalize(X):

normalized = []

for x in X:

words = nltk.word_tokenize(x)

normalized.append(' '.join([stemmer.stem(word) for word in words if re.match('[a-zA-Z]+', word)]))

return normalized

Is defined a pipeline with three steps:

Apply the normalize function

function Vectorize all the documents using Bag of Words(this step also remove stopwords)

Transform the Bag of Words into a TF-IDF matrix

We then call fit_transform on our pipeline and passing the plot column of all movies as argument. This method will run each step sequentially, transforming the data along the way and return the result of the last step, which is a (n_movies, n_words) matrix with the corresponding TF-IDF vector for each movie.

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.preprocessing import FunctionTransformer



pipe = Pipeline([

('normalize', FunctionTransformer(normalize, validate=False)),

('counter_vectorizer', CountVectorizer(

max_df=0.8, max_features=200000,

min_df=0.2, stop_words='english',

ngram_range=(1,3)

)),

('tfidf_transform', TfidfTransformer())

])



tfidf_matrix = pipe.fit_transform([x for x in movies_df['plot']])

Since the sentences are converted into vectors, we can compute the cosine similarity between then and represent the “distance” between those vectors. The cosine similarity is calculated using the tfidf_matrix and returns a matrix with dimensions (n_movies, n_movies) with the similarity between them.

from sklearn.metrics.pairwise import cosine_similarity similarity_distance = 1 - cosine_similarity(tfidf_matrix)

Using the similarity distance, it is possible to create a dendrogram:

import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import linkage, dendrogram



mergings = linkage(similarity_distance, method='complete')

dendrogram_ = dendrogram(mergings,

labels=[x for x in movies_df["title"]],

leaf_rotation=90,

leaf_font_size=16,

)



fig = plt.gcf()

_ = [lbl.set_color('r') for lbl in plt.gca().get_xmajorticklabels()]

fig.set_size_inches(108, 21)



plt.show()

Dendrogram of similar movies

We can even create a function to search for the movie most similar to another. Here we are using a numpy method argsoft to find de second most similar movie in the matrix. This is because the minimum distance of a movie is with itself. This can be seen in similarity_distance in the main diagonal(all the values are zero).

def find_similar(title):

index = movies_df[movies_df['title'] == title].index[0]

vector = similarity_distance[index, :]

most_similar = movies_df.iloc[np.argsort(vector)[1], 1]

return most_similar print(find_similar('Good Will Hunting')) # prints "The Graduate"

Conclusion

Train a machine learning model for basics tasks in NLP is simple. First, data is tokenized and filtered so that it can be represented with units called tokens. We can also represent the words to their root form, so the vocabulary can also be reduced and then we vectorize our dataset using an algorithm that depends on the problem we are trying to solve. After that we can train some machine learning model, generate more text, visualize our data, and so on.

See you next time!