Here at Lab 41, we launched the Hermes challenge to look into recommender systems as a means to answer the question, "How can we help analysts better connect the dots across a number of different sources and types of data?" My colleagues have spent a great deal of effort sharing their work in this space already, introducing recommender systems in general, going over what datasets we have used to apply recommender system algorithms, and discussing non-standard performance metrics we can use to compare these recommender system algorithms. This blog post will focus on standard performance metrics that we use to compare the different recommender system algorithms.

Think of a recommender system as fulfilling an information retrieval task—retrieving items in an ordered list for a particular user. The standard practice of quantifying how well the system retrieves this information in the scientific world is via accuracy and precision. Accuracy is determined by how close you are to the correct result while precision is how consistently you receive the same result.

We will discuss in more detail exactly what accuracy and precision mean in the world of recommender systems. For now, understand that we will use accuracy and precision as well as other performance metrics to quantify the prediction error of a recommender system. We will also look into Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) to determine how well a recommender system performs when predicting a non-binary rating—for example, predicting what rating a user will give a movie on a scale from 1 to 5. We will also delve into metrics that address the performance of a recommender system that works only with binary ratings. For example, differentiating a movie the user is likely to watch with bated breath and maximum interest from one they will fall asleep watching. In addition, we will define recall and F-score for supplementary insight into recommendations on binary values. And what better way to learn about these performance metrics than to get your hands dirty implementing and using them in Spark! At the edge of your seat already? Well I recommend that you read on!

Predictive Performance Metrics



There is a plethora of recommendation algorithms that you can apply to a dataset. We view recommendations as a supervised learning problem, where the task is to predict users' ratings of items, interactions between users and items, or even links between users and items. A model trained on a labeled set of user ratings or interactions can be used to make predictions (i.e. recommendations) on a separate set of inputs similar to the training set.

With as many algorithms to choose from as there are mobile dating apps, how can we be sure that we apply the right performance metrics for comparing algorithms? After all, with recommenders as with online dating, being choosy makes a difference!

The answer is...it depends on the dataset in question. If the algorithm makes predictions of a non-binary value, like what rating a user will give on a movie on a scale of 1 to 5, we can use RMSE or MAE. RMSE and MAE calculate the distance between the predicted value given by the algorithm and the actual value given by the user. Since accuracy is determined by how close the prediction is to the correct result, the lower the error is, the closer the prediction is to the actual value. In other words, with RMSE or MAE, we can determine how well a user's ratings can be reproduced by the recommender system algorithm.

Root Mean Squared Error and Mean Absolute Error



Root Mean Squared Error (RMSE) can be described as

where is the predicted value, is the actual value, and is the number of items to be predicted.

Subtracting from allows us to quantify how far off the predicted value is from the actual value. By squaring the difference, , we can keep the error positive, no matter if the predicted or the actual value is higher. Summing up all of the differences and dividing them by the number of items to be predicted gives us the average of the prediction errors. Since squaring the difference changes the scale, we need to bring the scale back down by putting a square root over the entire equation. RMSE, therefore, describes the average of how far off the predicted value is from the actual value.

Mean Absolute Error, or MAE, can be described as

The only difference between RMSE and MAE is in they keep the error positive. RMSE squares the residual while MAE takes the absolute value. Since it doesn't square the difference, MAE places less emphasis on large deviations than RMSE. MAE, therefore, punishes large errors much less severely than RMSE does.

Implementation of RMSE and MAE in Spark



If you are not familiar with the use of resilient distributed datasets (RDD) in Spark, please check out my blog post on the topic as I will be using RDDs throughout this section.

Let's consider the case of a recommender system trying to predict what rating a user will assign to a movie. We have an RDD called that the recommender system algorithm outputs. is in the form [(user_id, movie_id, predicted_rating)] . It lists the predicted rating for each user and movie pair. is the RDD that has the actual rating for each user and movie pair. Its format is [(user_id, movie_id, and actual_rating)] . To implement RMSE in Spark, we first have to reformat and so that user_id and movie_id are used as the key.

y_predicted_reformat = y_predicted.map(

lambda (user_id, movie_id, predicted_rating): (user_id, movie_id), predicted_rating)

)

y_actual_reformat = y_actual.map(

lambda (user_id, movie_id, actual_rating): ((user_id, movie_id), actual_rating)

)

Once you have reformatted the RDD, you can join the result together so that you have both predicted_rating and actual_rating in the same RDD to compute their squared difference.

ratings_diff_sq = (y_predicted_reformat).join(y_actual_reformat) \

.map(lambda (_, (predictedRating, actualRating)): (predictedRating - actualRating) ** 2 )

You determine the average of the prediction errors by adding all the differences together with the reduce function and then dividing by the number of ratings.

sum_ratings_diff_sq = ratings_diff_sq.reduce(add)

num = ratings_diff_sq.count()

average_prediction_error = sum_ratings_diff_sq / float(num)

To scale it back down after squaring the difference, take the square root of the prediction error average.

rmse = sqrt(average_prediction_error)

Enough of all that, glad to see you've made it this far! Now that you know how to compute RMSE in Spark, your homework is to implement MAE. The answer can be found in Hermes's GitHub repo, so go check that out after you are done here.

Although it is good practice to know how to implement RMSE and MAE in Spark, Spark's MLlib has its own implementations of each that you should probably use. The only input required is an RDD containing the predicted and actual rating pair in the format of [(predicted_rating, actual_rating)] .

predicted_and_actual_ratings = y_predicted_reformat.join(y_actual_reformat) \

.reduceByKey(lambda predicted_rating, actual_rating: predicted_rating + actual_rating) \

.map(lambda ((user_id, movie_id), (predicted_rating, actual_rating)): \

(predicted_rating, actual_rating)

)

from pyspark.mllib.evaluation import RegressionMetrics

metrics = RegressionMetrics(predicted_and_actual_ratings)

rmse = metrics.rootMeanSquaredError

mae = metrics.meanAbsoluteError

Classification Performance Metrics



RMSE and MAE allow us to compare systems that output non-binary values. If we need to compare recommender systems of binary values, we can employ accuracy, precision, recall or F-scores.

Before we dive deeper into what each of these metric entails, we should first go over what a confusion matrix is, as this will make it easier for us to understand the metrics. In a binary classification problem where an event can either occur or not occur, there are only four possible combinations of predicted and actual value.

We can either successfully predict whether or not the event occurs:

predict that the event is likely to occur and it does occur: true positive (TP)

predict that the event is not likely to occur and it does not occur: true negative (TN)

or we can fail to do so:

predict that the event is likely to occur but it does not occur: false negative (FP)

predict that the event is not likely to occur but it does occur: false positive (FN)

When your recommender has run on a dataset you can tally up each of these cases and put them in in a table, often called a confusion matrix.

Actual event occured Actual event did not occur Predict that event will occur TP FP Predict that event will not occur FN TN

Accuracy, Prediction, Recall and F-score



Let's look at a similar case to the one we looked at for RMSE and MAE, but instead try to determine if a recommender system algorithm can assess whether or not a movie is good (assuming a clean binary distinction between good and bad movies). If an algorithm is accurate, we can just count the number of correct classifications over the total number of cases. We can express that as

Precision helps us assess the likelihood that a positive prediction (movie was good) is indeed positive. In fact, in other fields, precision also goes by the name "positive predictive value." It is defined as the ratio of true positives to positive predictions:

Precision can be easily confused with recall. Recall is the proportion of correct positive classifications of cases that are actually positive. In the binary movie quality example, recall measures the proportion of good movies that the recommender system successfully recommends.

Precision and recall are inversely related. As we recommend more items, recall increases but precision decreases, and vice versa. We can combine precision and recall to provide a single measurement for a recommender system: the F-score. The traditional ("balanced") F-score is defined as

Implementation of Precision, Recall, and F-score in Spark



Since Spark has its own library that computes precision, recall and F-score, we will demonstrate how to compute these metrics in MLlib and leave the manual implementation of each as homework. The only input the library requires is an RDD containing the predicted and actual classification pair. We will use an RDD called predicted_and_actual_classifications in the format of [(predicted_classification, actual_classification)] . We will also assume that the prediction and actual classification is either 0.0 or 1.0, where 0.0 is considered a "bad" movie and 1.0 is considered a "good" movie.

from pyspark.mllib.evaluation import RegressionMetrics

metrics = RegressionMetrics(predicted_and_actual_classifications)

confusion_matrix = metrics.confusionMatrix().toArray()

precision = metrics.precision()

recall = metrics.recall()

f1 = metrics.fMeasure()

precision_for_bad_movies = metrics.precision(0.0)

precision_for_good_movies = metrics.precision(1.0)

recall_for_bad_movies = metrics.recall(0.0)

recall_for_good_movies = metrics.recall(1.0)

f1_for_bad_movies = metrics.fMeasure(0.0, 1.0)

f1_for_good_movies = metrics.fMeasure(1.0, 1.0)

Conclusion



Accuracy and precision allow us to train a recommender system by minimizing the prediction error and then estimating the quality of the recommender system for other recommendations, where we have no ground truth. However, we must not lose sight of other metrics that can describe a given recommender system, as explained in Anna's blog post on metrics for novelty, diversity, and serendipity in recommendations. I encourage you to take a look! If we focus too highly on accuracy, precision, and recall, our recommender system model might overfit the training data at the expense of producing quality recommendations. Although we only covered a small subset of performance metrics in this post, we hope you have a better understanding of their use. You can learn more about other performance metrics implemented in Spark by either reading the documentation or the source code. And if you are interested in participating in Hermes, please check out our GitHub page. We hope to see you soon!

Tags: metric, recommender system