Introduction

Recommender systems are a vital tool in a data scientists’ toolbox. The aim is simple, given data on customers and items they’ve bought, automatically make recommendations of other products they’d like. You can see these systems in action on a lot of websites (for example Amazon), and it’s not just limited to physical products, they can be used for any customer interaction. Recommender systems were introduced in a previous Cambridge Spark tutorial.

In this tutorial, we want to extend the previous article by showing you how to build recommender systems in python using cutting-edge algorithms. Many traditional methods for training recommender systems are bad at making predictions due to a process known as overfitting. Overfitting in the context of recommender systems means our model will be good at fitting to the data we have, but bad at recommending new products to customers (not ideal given that is their purpose). This is because there is so much missing information. Before getting into the code, we’ll give more details about recommender systems, why there is so much missing information, and its solution.

Recommender Systems and Matrix Factorisation

The data input for a recommender system can be thought of as a large matrix, with the rows indicating an entry for a customer, and the columns indicating an entry for a particular item. Let’s call this matrix 𝑅. Then entry 𝑅𝒾𝑗 will contain the score that customer 𝒾 has given to product 𝑗. For example if it’s a review this could be a number from 1–5, or it might just be 0–1 indicating if a user has bought an item or not. This matrix contains a lot of missing information, it’s unlikely a customer has bought every item on Amazon! Recommender systems aim to fill in this missing information, by predicting the customer score of items where the score is missing. Then recommender systems will recommend items to the customer that have the highest score. A typical example of the matrix 𝑅 with entries that are review values from 1–5 is given in the picture below.

The Netflix challenge was a competition designed to find the best algorithms for recommender systems. It was run by Netflix using their movie data. During the challenge, one type of algorithm stood out for its excellent performance, and has remained popular ever since. It is known as matrix factorisation. This method works by trying to factorise the matrix 𝑅 into two lower dimensional matrices 𝑈 and 𝑉, so that 𝑅=𝑈ᵀ𝑉.

Suppose that R has dimension 𝒅₁×𝒅₂, then U will have dimension 𝑫×𝒅₁ and V will have dimension 𝑫×𝒅₂. Here 𝑫 is chosen by the user, it needs to be large enough to encode the nuances of 𝑅, but making it too large will make performance slow and could lead to overfitting. A typical size of 𝑫 is 20.

While at first glance, this factorisation seems quite easy, it is made much more difficult by all the missing data. Imputing the data might work, but it makes the methods very slow. Instead, most popular methods focus only on the matrix entries 𝑅𝒾𝑗 that are known, and fit the factorisation to minimise the error of these known 𝑅𝒾𝑗. A problem with doing this though is that predictions will be bad because of overfitting. The methods get around this by using a procedure known as regularisation, which is a common way to reduce overfitting. For more details about the workings of the methods, please see the further reading at the end of this post.

The Package