With everyone self-quarantined these day due to the Coronavirus, most of us are cooped up at home watching Netflix to help pass the time. When you’re browsing through the endless list of movies, and even after you’ve finished one, you’ve probably seen that Netflix feature that recommends other movies that you may also like. How does Netflix know how to make these recommendations for you? In this project, I will create a movie recommender system in Python to show you how companies like Netflix (+Amazon, Spotify, and YouTube) are able to make intelligent recommendations for their products and services.

“If you liked the Inception movie, you’ll like these five other movies as well!” –The Netflix Movie Recommender System

Recommender Systems are something I’ve always thought were so cool, and in fact one of the reasons I wanted to pursue a career in data science. It seemed like such an intelligent feature and I wanted understand the science and algorithm behind how a recommender system works. It’s the same algorithm that companies like Amazon uses for their product recommendations, Spotify uses for their song recommendations and discover weekly playlists, and YouTube uses for their homepage and video recommendations. For these companies, their recommender systems not only encourage users to continue using their products, but it gets the users hooked and generates more revenue for the company. For Netflix, the recommender systems encourages users to continue watching movies after they’ve already finished one, which inevitable leads to users binge watching for hours on end. How many times have you said “just one more movie (or tv show)” and before you know it it’s already 3am and you have to get up for work in a few hours. This not only keeps you from wanting to cancel your Netflix subscription but it also encourages you to tell your friends about the latest Netflix movie you watched (or documentary/tv series), thus encouraging your friends to go on Netflix as well, and the vicious cycle goes on. We all know that the most powerful form of marketing is word of mouth.

So how does it all work?

Netflix Recommender System: “If you liked Inception, you’ll also like Blade Runner, 2012, Birdbox, Altered Carbon, and more!”

The Data Set

For this Recommender System, I will be using Python along with a data set from https://grouplens.org/datasets/movielens/ which has about 100,000 real-life movie reviews from 944 users rating movies across 1,664 movie titles. We will use this data to create a simplified version of Netflix’s recommender system.

This data set includes the following four fields:

User ID: User ID of the individual who rated the movie

User ID of the individual who rated the movie Movie: The name of the movie

The name of the movie Rating: The user’s rating of the movie (1 = worst to 5 = best)

The user’s rating of the movie (1 = worst to 5 = best) Timestamp: Date/Time at which the user submitted the review

Below are the python libraries we’ll be importing for this recommender system:

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

I’ve already downloaded the data to my desktop and loaded it into the same folder as my Jupyter notebook. We have one main data file (u.data) with the user_id, item_id (movie ID, not title), rating, and timestamp.We also have a second file (“Movie_Id_Titles”) which has the movie id and the movie title which we will be merging into the dataframe to pull the movie title into our data set. Here are the first five rows of our data set.

column_names = ['user_id', 'item_id', 'rating', 'timestamp'] df = pd.read_csv('u.data', sep='\t', names=column_names) movie_titles = pd.read_csv("Movie_Id_Titles") df = pd.merge(df,movie_titles,on='item_id') df.head()

Before we create our Recommender System, the first step is to do some exploratory data analysis so that we can understand and familiarize ourselves with the data.

Exploratory Data Analysis

Let’s a take a look at the highest-rated movies based on the average movie rating.

df.groupby('title')['rating'].mean().sort_values(ascending=False).head()

Above are the top five movies from the data set. However, there is something wrong with the data here. For a movie to have an average rating of 5.0, that means every single person who watched the movie rated it a 5, which is unlikely even for the most critically acclaimed movies. The movies we see above are probably movies that were only rated a few times. We need to pull in the number of reviews to paint a more complete picture of the rating. Below are the top five movies based solely on the number of reviews.

df.groupby('title')['rating'].count().sort_values(ascending=False).head()

Here, we see that the movie with the highest number of ratings is Star Wars (1977) with 584 ratings (the max in this data set). Lets combine the number of reviews and the average rating for each movie into the same table, and look at the top 25 movies again sorted by average rating.

ratings = pd.DataFrame(df.groupby('title')['rating'].mean()) ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count()) ratings.sort_values(by='rating',ascending=False).head(25)

As expected, we see that the movies with a perfect rating of 5.0 only had a few ratings (<3 in the green box), while the more popular movies with much more ratings (>100 in the yellow box) have average ratings no greater than 4.5. This will be good to know as we think about how to accurately pull the best recommendations for a movie.

Next, let’s see what the distribution for the number of ratings looks like by plotting a histogram with 100 bins using matplotlib.pyplot.

plt.figure(figsize=(12,6)) plt.title('Histogram: Number of Ratings') plt.xlabel('Number of Ratings for Movie') plt.ylabel('Number of Ratings in Bucket') ratings['num of ratings'].hist(bins=100)

From the histogram, we see that most of the movies don’t have many ratings (<50 ratings on x-axis). In fact movies have 0 to 3 ratings. This makes sense as most people would watch the more famous blockbuster movies, so those movies would have a lot more ratings as we see in the small bars as we move down the x-axis.

Next, let’s take a look at the histogram for the average movie ratings.

plt.figure(figsize=(12,6)) plt.title('Histogram: Average Rating') plt.xlabel('Average Rating for Movie') plt.ylabel('Number of Movies') ratings['rating'].hist(bins=50)

Notice how we see some peaks at the whole numbers (1,2,3,4,5). This is likely due to those movies that were only rated a few times. Movies with many reviews are less likely to have averages that are exactly 1.0, 2.0, etc. The remaining movies gravitate towards the 2.5 to 4.0, then taper off to 4.5, and follow a relatively normal distribution. There are a few outliers around 1 to 2.5 and these are likely to be the movies that are simply bad.

Next, let’s take make a Joint Plot to see the relationship, or correlation, between the average ratings and the number of ratings for these movies. Correlation (also known as Pearson’s R) is an important concept to know for recommender systems, and it is defined as a statistical relationship, whether causal or not, between two variables. The idea is that if two variables are correlated, then one variable increases as the other variable increases as well. A correlation of 1.0 means that there is a perfect positive correlation, while a correlation of 0 means there is no correlation between the two variables. A correlation can also be negative, meaning that as one variable increases, the other variable decreases, but that will not be relevant to our recommender system today.

Let’s create that Join Plot.

sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)

From the Joint Plot above, we see that there tends to be a positive relationship between Number of Ratings and the Average Ratings. Generally speaking, the higher number of ratings a movie has (y-axis), the higher average rating the movie will have (x-axis). This makes sense because if a movie is good, then more more people are likely to watch it and leave a positive review. Notice how this trend only goes up until 4.5, the maximum rating for movies with many ratings (>50). This makes sense as it is unlikely or extremely difficult for every single person who watches a famous movie to rate the movie a 5. If even one person fails to give a 5 star rating, then the average can not be a perfect 5.

Okay, now that we’ve explored the data, let’s get into how to actually create our recommender system using this idea of correlation.

Building the Recommender System

To create this recommender system, we create a pivot table with the Movie Titles as the columns, the UserID as the rows, and the Rating as the values. Below are the first five rows of the pivot table.

moviemat = df.pivot_table(index='user_id',columns='title',values='rating') moviemat.head()

If you look at the table, we see that most of the movies have no/null ratings (NaN). This makes sense because most people have not seen most or all of the 1,644 movies in this data set.

Let’s take a look at our data again and select two movies that we would like to get recommendations for. We sort by the number of ratings to find two popular movies that have a high number ratings.

ratings.sort_values('num of ratings',ascending=False).head(10)

From the table able, let’s select Star Wars (1977) and Liar Liar (1997). The first movie is a Fantasy/Action movie from the 1970s, and the second is a comedy from 1990s. Lets take the user ratings for these two movies from the pivot table. If you recall from the pivot table, these would be a single column of user ratings for each movie.

starwars_user_ratings = moviemat['Star Wars (1977)'] liarliar_user_ratings = moviemat['Liar Liar (1997)']

Next, we will use the corrwith() method to get the correlations between movies. This will give us the correlation between Star Wars/Liar Liar and every other movie in the data set. We will remove the nulls (NaN) as they give us no value, and also add the number of ratings to the table. Outputs are seen below, sorted by alphabetical order (by default).

similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings) corr_liarliar = pd.DataFrame(similar_to_starwars,columns=['Correlation']) corr_liarliar.dropna(inplace=True) corr_starwars = corr_liarliar.join(ratings['num of ratings']) # corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False) corr_liarliar.head(10)

similar_to_starwars = moviemat.corrwith(starwars_user_ratings) corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation']) corr_starwars.dropna(inplace=True) corr_starwars = corr_starwars.join(ratings['num of ratings']) # corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation',ascending=False) corr_starwars.head(10)

Key Concept: The idea of creating these correlation tables is to find the movies that are highly correlated with each other. This means that if someone rates a movie with a high rating, then they are likely to rate the second movie with a high rating as well. In other words, if someone likes the first movie, then they’ll also like the second movie. This is the most important concept and the crux of how a recommender system works!

If two movies’ ratings are highly correlated (Pearson R is close to 1), this means that if someone likes the first movie, they will also like the second movie as well.

Finally, we are at the final step of our recommender system. The tables above are sorted by alphabetical order, but there is one more filter and sort we should apply to get the most accurate list of positively correlated movies. We should filter out the movies that have a low number of ratings as these low number of ratings will skew the correlations as we have seen with the unlikely perfect 1.0 correlations during the data exploration phase. Therefore, we will only look at the movies that have more than 100 ratings. Last, we will sort by the correlation in descending order to get our final list of top 10 recommended movies for Star Wars and Liar Liar.

STAR WARS (1977) TOP 10 RECOMMENDED MOVIES

corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation',ascending=False).head(10)

The table above show the top 10 movies recommended for the 1977 Star Wars movie. We can ignore the first Star Wars (1977) with 1.0 correlation as a variable will always be perfectly correlated with itself. We see that the Empire Strikes Back (1980) and Return of the Jedi (1983) movies are #2 and #3 on the list with high correlations >0.67. This makes a lot of sense as they are sequels of the same Star Wars saga. Raiders of the Lost Ark (1981) made the #4 spot which makes sense as it is also a popular Indiana Jones fantasy/action movie in the 80s. As we go down the list, we see that the that number of ratings goes down, as well as the correlation (we see a big drop in correlation after Raiders of the Lost Ark (1971)). The relevance of the movies compared to Star Wars tends to go down as well. For example, it wouldn’t make sense to see Austin Powers (comedy, 1997) or Toy Story (children’s animation, 1995) as recommended movies for the Star Wars. However, the reason why sometimes these other famous movies make it to the list is because they are popular in nature. So people who like one popular movie are likely to like another (unrelated) popular movie because popular movies tend to have higher ratings. As mentioned, this recommender system is only a simplified version of the complex Netflix recommender system, but the possible ways to account for this are discussed in the conclusion section of this page.

Lets take a quick look at the top 10 movies for Liar Liar.

Liar Liar (1997) TOP 10 RECOMMENDED MOVIES

corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False).head(10)

Above, we see the top 10 recommended movies for Liar Liar. The first choice is Batman Forever which could possibly be a good recommendation, but The Mask (1994) would be an even better recommendation as it is also a comedy in the 1990s starring Jim Carrey. As you’ll see, the correlations are relatively low to begin with (<0.52) but this could be because of the small amount of data that we have, and we could play around with the number of ratings filter (current threshold = 100 ratings) to adjust our criteria (for example, setting the threshold to 200 instead of 100). In the next section we will talk about ways to improve and tweak our recommender system algorithm.

Conclusion

Hopefully by now, you understand the idea of how recommender systems like Netflix’s work. They take ratings for movies and find the correlation between that movie and every other movie in the Netflix system. Then it sorts the movies by the average correlation in descending order (1 = highest correlation, 0 = no correlation) and filters by a minimum number ratings to get a robust list of recommended movies.

From our final lists for Star Wars and Liar Liar, we saw that the algorithm works, but it wasn’t perfect – and that’s okay as we only made a simplified version of the recommender system and didn’t have all the data that Netflix has. Even if we did, my laptop would not be able to process the petabytes of data that the Netflix database actually has. In the real life Netflix algorithm, they have tons and tons of data (literally petabytes of data) and a lot more complex filtering algorithms. For example, they could filter their movie recommendations by similar genres, sub-genres, famous actors/actresses in the movie, which other movies the user has already seen and liked, which movies are currently trending amongst other users with similar tastes, demographics, age, gender, ethnicity, language, and so much more. Since our data set was pretty small (only 100,000 ratings over 1664 movies), we didn’t have as much data to feed and filter by in our recommender system. A recommender system is only as good as the data that’s in it. Also, we only used correlation and number of ratings for our criteria, which is simplified. If we had more data such as user information and other movie details, we could create a more accurate and complex system. With the massive amount of data and computing power of their infrastructure, Netflix is able to pull in instantaneous and highly accurate movie recommendations for any movie in their enormous database. Think about all that volume of data and intelligent data processing that happens behind the scenes whenever you load a Netflix screen, all customized and updating in real time to improve your customer experience and feel you into their services – impressive stuff! Quite a revolutionary feature when you think about how we used to select movies at blockbuster by looking at dvd covers up and down the movie aisles.

Although every recommender system is different, they all follow the same idea. This recommender system is also used for Amazon, Spotify and YouTube. On Amazon, people who liked the Apple iPhone also like the Apple iPad. Since their ratings are highly correlated and the items are similar, they would be recommended for each other in their recommender system. On Spotify, if you like the song “Levels” by Avicii, you’ll also like “Save the World” by Swedish House Mafia, so you will likely see these songs pop up for each other in recommended playlists, and song radios. On YouTube, if you liked a lot of videos about chicken wings recipes, you could get recommendations for the latest “Hot Ones” episode.

It’s really cool what can be done with data nowadays, and the power it has if people know how to use it. I’m really glad I finally learned how the recommender system works and was able to share this cool algorithm with all of you.

Cheers,

Scion