In this tutorial, we will dive into recommendation systems.

You might not know what recommendation systems are but you see them everywhere on the internet.

Everytime you shop on Amazon and you see related products...

Or when Netflix recommends you something interesting to watch...

The purpose of a recommendation system is to predict a rating that a user will give to an item that they have not yet rated.

This rating is produced by analyzing either item characteristics or other user/item ratings (or both) to provide personalized recommendations to users.

There are 2 main approaches to recommendation systems:

Content Filtering. Recommendations depend on item characteristics.

Collaborative Filtering. Recommendations depend on user-item ratings.

In this tutorial we will work with the MovieLens Dataset. This dataset contains user generated movie ratings from the website MovieLens (https://movielens.org/).

It contains multiple files, but the ones we will use in this tutorial will be movies.dat and ratings.dat.

First we will download the dataset:

wget http://files.grouplens.org/datasets/movielens/ml-1m.zip unzip ml-1m.zip cd ml-1m/

Content Filtering

Here are the first rows of the movies.dat file. The file follows the format:

movieid::movietitle::movie genre(s)

head movies.dat 1::Toy Story (1995)::Animation|Children's|Comedy 2::Jumanji (1995)::Adventure|Children's|Fantasy 3::Grumpier Old Men (1995)::Comedy|Romance 4::Waiting to Exhale (1995)::Comedy|Drama 5::Father of the Bride Part II (1995)::Comedy 6::Heat (1995)::Action|Crime|Thriller 7::Sabrina (1995)::Comedy|Romance 8::Tom and Huck (1995)::Adventure|Children's 9::Sudden Death (1995)::Action 10::GoldenEye (1995)::Action|Adventure|Thriller

With genres being separated by a pipe |.

We load now the movies file:

import pandas as pd import numpy as np movies_df = pd.read_table('movies.dat', header=None, sep='::', names=['movie_id', 'movie_title', 'movie_genre']) movies_df.head()

Out[]:

movie_id movie_title movie_genre 0 1 Toy Story (1995) Animation|Children's|Comedy 1 2 Jumanji (1995) Adventure|Children's|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama

In order to be able to work with the movie_genre column, we need to transform it to what is called "dummy variables".

This is a way to convert a categorical variable (e.g. Animation, Comedy, Romance...), into multiple columns (one column named Action, one named Comedy, etc).

For each movie, these dummy columns will have a value of 0 except for those genres the movie has.

# we convert the movie genres to a set of dummy variables movies_df = pd.concat([movies_df, movies_df.movie_genre.str.get_dummies(sep='|')], axis=1) movies_df.head()

Out[]:

movie_id movie_title movie_genre Action Adventure Animation Children's Comedy Crime Documentary ... 0 1 Toy Story (1995) Animation|Children's|Comedy 0 0 1 1 1 0 0 ... 1 2 Jumanji (1995) Adventure|Children's|Fantasy 0 1 0 1 0 0 0 ... 2 3 Grumpier Old Men (1995) Comedy|Romance 0 0 0 0 1 0 0 ... 3 4 Waiting to Exhale (1995) Comedy|Drama 0 0 0 0 1 0 0 ... 4 5 Father of the Bride Part II (1995) Comedy 0 0 0 0 1 0 0 ...

So for example, the movie with an id of 1 Toy Story, belongs to the genres Animation, Children's and Comedy, and thus the columns Animation, Children's and Comedy have a value of 1.

movie_categories = movies_df.columns[3:] movies_df.loc[0]

Out[]:

movie_id 1 movie_title Toy Story (1995) movie_genre Animation|Children's|Comedy Action 0 Adventure 0 Animation 1 Children's 1 Comedy 1 Crime 0 Documentary 0 Drama 0 Fantasy 0 Film-Noir 0 Horror 0 Musical 0 Mystery 0 Romance 0 Sci-Fi 0 Thriller 0 War 0 Western 0 Name: 0, dtype: object

Content filtering is a simple way to build a recommendation system. Here, items (in this example movies) are mapped to a set of features (genres).

To recommend a user an item, first that user has to provide his/her preferences regarding those features.

So in this example, the user has to tell the system how much does he or she like each movie genre.

Right now we have all the movies mapped into genres. We just need to create a user and map that user into those genres.

Let's create a user with strong preference for action, adventure and fiction movies.

from collections import OrderedDict user_preferences = OrderedDict(zip(movie_categories, [])) user_preferences['Action'] = 5 user_preferences['Adventure'] = 5 user_preferences['Animation'] = 1 user_preferences["Children's"] = 1 user_preferences["Comedy"] = 3 user_preferences['Crime'] = 2 user_preferences['Documentary'] = 1 user_preferences['Drama'] = 1 user_preferences['Fantasy'] = 5 user_preferences['Film-Noir'] = 1 user_preferences['Horror'] = 2 user_preferences['Musical'] = 1 user_preferences['Mystery'] = 3 user_preferences['Romance'] = 1 user_preferences['Sci-Fi'] = 5 user_preferences['War'] = 3 user_preferences['Thriller'] = 2 user_preferences['Western'] =1

Once we have users with their movie genre preferences and the movies mapped into genres, to compute the score of a movie for a specific user, we just need to calculate the dot product of that movie genre vector with that user preferences vector.

#in production you would use np.dot instead of writing your own dot product function. def dot_product(vector_1, vector_2): return sum([ i*j for i,j in zip(vector_1, vector_2)]) def get_movie_score(movie_features, user_preferences): return dot_product(movie_features, user_preferences)

Let's compute the score of the movie 'Toy Story' (a children's animation movie) for the sample user.

toy_story_features = movies_df.loc[0][movie_categories] toy_story_features

Action 0 Adventure 0 Animation 1 Children's 1 Comedy 1 Crime 0 Documentary 0 Drama 0 Fantasy 0 Film-Noir 0 Horror 0 Musical 0 Mystery 0 Romance 0 Sci-Fi 0 Thriller 0 War 0 Western 0 Name: 0, dtype: object

toy_story_user_predicted_score = dot_product(toy_story_features, user_preferences.values()) toy_story_user_predicted_score

Out[]:

5

So for the user, Toy Story, has a score of 5. Which does not mean much by itself, but helps us comparing how good of a recommendation Toy Story is compared to other movies.

Let's calculate the score for Die Hard (a thrilling action movie):

movies_df[movies_df.movie_title.str.contains('Die Hard')]

movie_id movie_title movie_genre Action Adventure Animation Children's Comedy Crime Documentary ... 163 165 Die Hard: With a Vengeance (1995) Action|Thriller 1 0 0 0 0 0 0 ... 1023 1036 Die Hard (1988) Action|Thriller 1 0 0 0 0 0 0 ... 1349 1370 Die Hard 2 (1990) Action|Thriller 1 0 0 0 0 0 0 ...

die_hard_id = 1036 die_hard_features = movies_df[movies_df.movie_id==die_hard_id][movie_categories] die_hard_features.T

Out[]:

1023 Action 1 Adventure 0 Animation 0 Children's 0 Comedy 0 Crime 0 Documentary 0 Drama 0 Fantasy 0 Film-Noir 0 Horror 0 Musical 0 Mystery 0 Romance 0 Sci-Fi 0 Thriller 1 War 0 Western 0

note, 1023 is the dataframe row index for Die Hard, not the movie index in the movielens dataset

die_hard_user_predicted_score = dot_product(die_hard_features.values[0], user_preferences.values()) die_hard_user_predicted_score

Out[]:

8

So we see that Die Hard gets an score of 8 vs a 5 for Toy Story. So Die Hard would be recommended before Toy Story. Which makes sense, given this user's preferences are skewed towards action packed movies.

Once we know how to calculate the score for one movie, providing movie recommendations for the user is as easy as calculating the score for all the movies and returning those with the highest scores.

def get_movie_recommendations(user_preferences, n_recommendations): #we add a column to the movies_df dataset with the calculated score for each movie for the given user movies_df['score'] = movies_df[movie_categories].apply(get_movie_score, args=([user_preferences.values()]), axis=1) return movies_df.sort_values(by=['score'], ascending=False)['movie_title'][:n_recommendations] get_movie_recommendations(user_preferences, 10)

Out[]:

2253 Soldier (1998) 257 Star Wars: Episode IV - A New Hope (1977) 2036 Tron (1982) 1197 Army of Darkness (1993) 2559 Star Wars: Episode I - The Phantom Menace (1999) 1985 Honey, I Shrunk the Kids (1989) 1192 Star Wars: Episode VI - Return of the Jedi (1983) 1111 Abyss, The (1989) 1848 Armageddon (1998) 2847 Total Recall (1990) Name: movie_title, dtype: object

So the system recommends heavy action and scifi movies. Neat!

Content Filtering makes recommending to a new user very easy. Users just have to express their preferences once. However, Content Filtering shows some caveats:

Need to map each item into the feature space. That means that any time a new item gets added, someone has to manually categorize that item.

Recommendations are limited in scope. This means items can't be categorized in new features.

So content filtering is maybe a too simple option nowadays, which leads us to...:

Collaborative Filtering

Collaborative filtering is another way of predicting user-item scores. This time though, we will use the existing user-item scores to predict the missing ones.

The assumption is that users get value from recommendations based on other users with similar tastes.

For this example we will use the ratings.dat file. This file follows the format:

userid::movieid::rating::timestamp

head ratings.dat 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 1::3408::4::978300275 1::2355::5::978824291 1::1197::3::978302268 1::1287::5::978302039 1::2804::5::978300719 1::594::4::978302268 1::919::4::978301368

The MovieLens dataset provides us with a file that includes over 1 million movie ratings.

ratings_df = pd.read_table('ratings.dat', header=None, sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp']) #we dont care about the time the rating was given del ratings_df['timestamp'] #replace movie_id with movie_title for legibility ratings_df = pd.merge(ratings_df, movies_df, on='movie_id')[['user_id', 'movie_title', 'movie_id','rating']] ratings_df.head()

Out[]:

user_id movie_title movie_id rating 0 1 One Flew Over the Cuckoo's Nest (1975) 1193 5 1 2 One Flew Over the Cuckoo's Nest (1975) 1193 5 2 12 One Flew Over the Cuckoo's Nest (1975) 1193 4 3 15 One Flew Over the Cuckoo's Nest (1975) 1193 4 4 17 One Flew Over the Cuckoo's Nest (1975) 1193 5



The dataset is a matrix of users and movie ratings, so we convert the ratings_df to a matrix with a user per row and a movie per column.

ratings_mtx_df = ratings_df.pivot_table(values='rating', index='user_id', columns='movie_title') ratings_mtx_df.fillna(0, inplace=True) movie_index = ratings_mtx_df.columns ratings_mtx_df.head()

Out[]:

movie_title $1,000,000 Duck (1971) 'Night Mother (1986) 'Til There Was You (1997) ... user_id 1 0 0 0 ... 2 0 0 0 ... 3 0 5 0 ... 4 0 0 1 ... 5 0 0 0 ...

We have a matrix of 6040 users and 3706 movies.

To compute similarities between movies, one way is to find the correlation between movies and then use that correlation to find similar movies to those the users have liked.

An easy way of doing this is in python is by using the numpy.corrcoef function, that calculates the Pearson Product Moment Correlation Coefficient (PMCC) between each item pair.

the PMCC has a value between -1 and 1 that measures the correlation (positive or negative) between two variables.

A correlation matrix is a matrix of m x m shape, where element Mij represents the correlation between item i and item j.

corr_matrix = np.corrcoef(ratings_mtx_df.T) corr_matrix.shape

Out[]:

(3706, 3706)

Note: We use the transposed ratings matrix to calculate the correlation matrix so it gives back the correlation between movies (rows). If we used the ratings matrix without transposing it, np.corrcoef would return the correlation between users.

Now, if we want to find similar movies to a specific movie, it's just a matter of returning those movies that have a high correlation coefficent with that one.

favoured_movie_title = 'Toy Story (1995)' favoured_movie_index = list(movie_index).index(favoured_movie_title) P = corr_matrix[favoured_movie_index] #only return those movies with a high correlation with Toy Story list(movie_index[(P>0.4) & (P<1.0)])

Out[]:

['Aladdin (1992)', "Bug's Life, A (1998)", 'Groundhog Day (1993)', 'Lion King, The (1994)', 'Toy Story 2 (1999)']

Now to provide recommendations to a user, we take the list of movies that user has rated. Then we sum the correlations of those movies with all the other ones and return a list of those movies sorted by their total correlation with the user.

def get_movie_similarity(movie_title): '''Returns correlation vector for a movie''' movie_idx = list(movie_index).index(movie_title) return corr_matrix[movie_idx] def get_movie_recommendations(user_movies): '''given a set of movies, it returns all the movies sorted by their correlation with the user''' movie_similarities = np.zeros(corr_matrix.shape[0]) for movie_id in user_movies: movie_similarities = movie_similarities + get_movie_similarity(movie_id) similarities_df = pd.DataFrame({ 'movie_title': movie_index, 'sum_similarity': movie_similarities }) similarities_df = similarities_df[~(similarities_df.movie_title.isin(user_movies))] similarities_df = similarities_df.sort_values(by=['sum_similarity'], ascending=False) return similarities_df

For example, let's select a user with a preference for kid's movies, and some action movies.

sample_user = 21 ratings_df[ratings_df.user_id==sample_user].sort_values(by=['rating'], ascending=False)

Out[]:

user_id movie_title movie_id rating 583304 21 Titan A.E. (2000) 3745 5 707307 21 Princess Mononoke, The (Mononoke Hime) (1997) 3000 5 70742 21 Star Wars: Episode VI - Return of the Jedi (1983) 1210 5 239644 21 South Park: Bigger, Longer and Uncut (1999) 2700 5 487530 21 Mad Max Beyond Thunderdome (1985) 3704 4 707652 21 Little Nemo: Adventures in Slumberland (1992) 2800 4 708015 21 Stop! Or My Mom Will Shoot (1992) 3268 3 706889 21 Brady Bunch Movie, The (1995) 585 3 623947 21 Iron Giant, The (1999) 2761 3 619784 21 Wild Wild West (1999) 2701 3 4211 21 Bug's Life, A (1998) 2355 3 368056 21 Akira (1988) 1274 3 226126 21 Who Framed Roger Rabbit? (1988) 2987 3 41633 21 Toy Story (1995) 1 3 34978 21 Aladdin (1992) 588 3 33432 21 Antz (1998) 2294 3 18917 21 Bambi (1942) 2018 1 612215 21 Devil's Advocate, The (1997) 1645 1 617656 21 Prince of Egypt, The (1998) 2394 1 440983 21 Pinocchio (1940) 596 1 707674 21 Messenger: The Story of Joan of Arc, The (1999) 3053 1 708194 21 House Party 2 (1991) 3774 1

Now we provide movie recommendations to the sample user by using his list of rated movies as an input.

sample_user_movies = ratings_df[ratings_df.user_id==sample_user].movie_title.tolist() recommendations = get_movie_recommendations(sample_user_movies) #We get the top 20 recommended movies recommendations.movie_title.head(20)

Out[]:

1939 Lion King, The (1994) 324 Beauty and the Beast (1991) 1948 Little Mermaid, The (1989) 3055 Snow White and the Seven Dwarfs (1937) 647 Charlotte's Web (1973) 679 Cinderella (1950) 1002 Dumbo (1941) 301 Batman (1989) 3250 Sword in the Stone, The (1963) 303 Batman Returns (1992) 2252 Mulan (1998) 2924 Secret of NIMH, The (1982) 2808 Robin Hood (1973) 3026 Sleeping Beauty (1959) 1781 Jungle Book, The (1967) 260 Back to the Future Part III (1990) 259 Back to the Future Part II (1989) 2558 Peter Pan (1953) 2347 NeverEnding Story, The (1984) 97 Alice in Wonderland (1951) Name: movie_title, dtype: object

So we see that the system recommends mostly kid's movies and some action movies. Neat!

Collaborative filtering is a widely used recommendation system nowadays. It is capable of recommending new items without having to manually define them. Also, it is able to find recommendations based on hidden features that an expert wouldn't be able to find (for example, combination of genres or actors).

However, it has one mayor drawback. Collaborative filtering cannot recommend items for a new user until he/she has reviewed some items. This problem is called the Cold Start Issue.

One way recommender systems overcome this issue is by using a hybrid Content + Colaborative Filtering. That is, using colaborative filtering as well as content filtering when necessary.

Further reading

Here are a few interesting readings on Recommendation systems.