Pearson correlation coefficient

In statistics, the Pearson correlation coefficient is a measure of the linear dependence or correlation between two variables X and Y. It has a value between +1 and −1 inclusive, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. In the case of recommender systems, we’re supposed to figure out how related two people are based on the items they both have ranked. The Pearson Correlation Coefficient (PCC) is better understood in this case as a measure of the slope of two datasets related by a single line (we’re not taking into account dimensions). The derivation and the formula itself are harder to find and understand, but by using this method, we’re eliminating the weight of harshness while measuring the relation between two people.

The PCC algorithm, requires two datasets as inputs, those datasets don’t come from how people ranked the items, but they come from the common ranked items between two people. PCC helps us to find the similarity of a pair of users. Rather than considering the distance between the rankings on two products, we can consider the correlation between the users ratings.

To clarify the concept of correlation, we include a new dataset and some charts. The dataset, includes few ratings of some remarkable computer scientists to some CS books.

A sample dataset to instruct you on this similarity measure. How some famous computer scientists have rated those famous CS books.

In order to understand how related are two people, we proceed by plotting their preferences (treating each book as a point, whose coordinates are determined by the rating on this item by both users). Once we have that specific plot, we do need to find the best fit straight line over those points. Finding such a line, requires knowledge of linear regression, a topic that’s out of the scope of this tutorial. While finding the best fit straight line, is not as trivial as it seems, finding the PCC depends just on the data that we already have. This best fit line serves us to explain the concept.

The plot shows the 2-dimensional space defined by the ratings of Ullman and Carmack, as well as the best fit straight line. The positive slope of the line, shows a positive correlation between those points, then, the PCC for Ullman and Carmack is positive.

The last plot, shows a negative correlation between Navarro and Norvig.

If we have one dataset {x[1], x[2], …, x[n]} containing n elements, and another dataset {x[1], x[2], …, x[n]} containing n elements, the formula for the sample PCC is:

A little algebraic manipulation, yield us to the following formula

This formula, let us write a program to compute the PCC between two people.

The implementation in Python of the Pearson Correlation Coefficient similarity measure, it’s directly inspired by the formula we just found. The whole code for this toy Recommender System is on Github.

Both similarity measures allow us to figure out how similar two people are. The logic behind a recommender system, is to measure everyone against a given person and find the closest people to that specific person, we can do that by taking a group of the people for whom the distance is small, or the similarity is high.

By using this approach, we’re trying to predict what’s going to be the rating if our person rates a group of products he has not rated yet. One of the most used approaches to this problem, is to take the ratings of all the other users and multiply how similar they are to the specific person by the rating that they gave to the product. If the product is very popular, and it has been rated by many people, it would have a greater weight, to normalize this behavior, we do need to divide that weight by the sum of all the similarities for the people that have rated the product. The following function implements this approach.

How to recommend items or retrieve similar people given an specific person, and a similarity measure. We’re using a higher order function to test the recommendations done by using any similarity measure function. The whole code for this toy Recommender System is on Github.

Given a person included in the index (data), a bound (that is maximum number of items to recommend), and a function to measure the similarity between people (euclidean_similarity, or pearson_similarity), this function gives an estimate on how the person would rate the item according to how similar people rate the item. As an example:

While Algorithms and Concurrency are perfect topics to recommend to Alan Perlis, or at least that was what our algorithm found, we should keep Marvin Minsky far from the Databases item. There is a strange phenomenon here, depending on the similarity measure, Marvin Minsky seems to like a lot or dislike a little bit the Programming language theory and Formal methods items. By looking for the scores variable while inspecting the code if you call recommend(“Marvin Minsky”, 5), you can tell that Robin Milner and John McCarthy are the closest to Marvin Minsky, while both Robin Milner and John McCarthy are very different from each other; and also Robin Milner tends to rate a little bit harsher than John McCarthy. That insight clearly taught us that we do need to compare both measures depending on the nature of our data, the election of bound also affects this kind of strange recommendations.

Data exploration, and wrangling comes as significant factors while implementing a production recommender system. The more data it can process, the better recommendations we can give our users. While recommender systems theory is much broader, recommender systems is a perfect canvas to explore machine learning, and data mining ideas, algorithms, etc. not only by the nature of the data, but because of the relative ease visualizing and comparing the results.