Modeling approach and project trajectory¶

I. Baseline modeling¶

Using the method described in Advances in Collaborative Filtering, (link) by Yehuda Koren's and Robert Bell, we created a baseline model with user bias and restaurant bias terms. According to Koren and Bell, the baseline model can be expressed as the following:

$b_{u i} = \mu + b_{u} + b_{ i}$

Approach 1: Use the formulas given in the Koren and Bell paper to find the bias¶

$b_{i}$ = $\left(\frac{\sum u \epsilon R(i) (r_{ui} - \mu)}{\lambda_{2} + |R(i)|}\right)$

$b_{u}$ = $\left(\frac{\sum u \epsilon R(i) (r_{ui} - \mu)}{\lambda_{3} + |R(u)|}\right)$

where $\mu$ represents the mean of all ratings. $b_{u}$ and $b_{i}$ represent the observed deviation of user $u$ and item $i$, respectively. In order to find the parameters $b_{u}$ and $b_{i}$, we used two approaches:

Where $\lambda_{2}$ and $\lambda_{3}$ are the regularization parameters (to be chosen by cross-validation), R(i) represents all users who rated item $i$, and R(u) represents all items that user $u$ has rated.

Approach 2: Use Ridge regression to find the bias terms¶

We performed Ridge Regression using cross-validation.

II. Memory-Based Collaborative filtering¶

As noted above in the Literature section, we wanted to take advantage of the wide user base and restaurant rating history in the Yelp database to provide a better recommendation system than the baseline model. To this aim, we implemented collaborative filtering (CF). The first step of CF is to create a user-item matrix, such as the one shown here for users and movies (image credit: CF ref 1).

Using this user-item matrix, we can calculate the similarities between two items, and also between two users. Between two items, we are looking to find how their ratings were similar. For user-user similarity, we look at all the items that were commonly rated by the two users and how much their ratings were similar. By doing so, we obtain similarity metrics for item-item and user-user, leading to item-item collaborative filtering and user-item collaborative filtering, respectively. These concepts are nicely illustrated in the following images (image credit: CF ref 2).

Item-Item collaborative filtering will measure similarity by observing all users who have rated both restaurants.

User-Item collaborative filtering will measure similarity by observing all items that are rated by both users.

To calculate the similarity, we used cosine similarity. We consider the ratings as vectors in n-dimensional space, and calculate the angle between these vectors to determine the similarity. Cosine similarity for users a and m are calculated using the following formula:

$$s_{u}^{cos}(u_{k},u_{a}) = \frac{u_{k} \cdot u_{a}}{\left\| u_{k} \right\|\left\| u_{a} \right\|} = \frac{\sum x_{k,m}x_{a,m}}{\sqrt{\sum x_{k,m}^{2}\sum x_{a,m}^{2}}}$$

$$s_{u}^{cos}(i_{m},i_{b}) = \frac{i_{m} \cdot i_{b}}{\left\| i_{m} \right\|\left\| i_{b} \right\|} = \frac{\sum x_{a,m}x_{a,b}}{\sqrt{\sum x_{a,m}^{2}\sum x_{a,b}^{2}}}$$

To calculate similarity between items m and b we used this formula:

Because the Yelp ratings are in the scale of 1-5 (all positive numbers), the output of cosine distance will range between 0 and 1, where a value closer to 1 means a higher similarity.

III. Model-Based Collaborative filtering¶

Because there is not a huge overlap between all users and all restaurants, we observed that our dataframe is very sparce (>99.9%), as expected. Given this sparsity, we also implemented model-based CF that is able to deal with sparsity better than memory-based CF. Other drawbacks to memory-based CF include 1) lack of scalability and 2) cold-start problem. Cold-start problem means that it is not possible to make predictions for new users or restaurants that has no previous rating at all. Model-based collaborative filtering can handle higher sparsity levels and is more scalable, but also suffer from the cold-start problem.

The way the model-based collaborative filtering handles high sparsity is through dimensionality reduction and latent variable decomposition. Although we only have the users' ratings for businesses as the dataset, underlying this data are the hidden preferences of users for certain businesses because of their hidden attributes. These latent variables are learned by the model-based CF. To do this, we used matrix factorization, which provides the latent vectors and allows filling in the sparse, original matrix: it predicts unknown ratings by taking the dot product of the latent features of users or items. For implementing matrix factorization, we used singular value decomposition (SVD). SVD is solved as follows:

$$X = U * S * V^{(T)} $$

Where:

X represents an m x n matrix

represents an x matrix U represents the m x r orthogonal matrix

represents the x orthogonal matrix S represents the r x r diagonal matrix with non-negative real numbers on the diagonal

represents the x diagonal matrix with non-negative real numbers on the diagonal $V^{(T)}$ represents the r x n orthogonal matrix

represents the x orthogonal matrix Elements on the diagonal of S are known as singular values of X

Matrix X can be factorized into U, S, and V.

U represents the feature vectors corresponding to users in the hidden feature space

represents the feature vectors corresponding to users in the hidden feature space V represents the feature vectors corresponding to items in the hidden feature space

Finally, we make our prediction by taking the dot product of U, S, and $V^{(T)}$.

IV. Model evaluation¶

To maintain compatibility with results published by others implementation of recommendation systems, we are going to adopt the standard set by the Netflix prize and the quality of our results will be measured using the root mean squared error (MSE).

$${\mathit{R}\mathit{M}\mathit{S}\mathit{E}} = \sqrt{\frac{1}{N}\sum(x_{i} - \hat{x_{i}})^{2}}$$

That measure puts more emphasis on large errors compared with the alternative of mean absolute error.

V. Decisions from the modeling¶

We found that the model-based CF had the lowest RMSE on the test set, thus we decided to pursue this approach for the recommendation system. We observed that for this model, the RMSE of validation data does not change too much for different values of k (i.e. at all values of k we tested, the RMSE is lower than other models by a factor of 10 or more). Thus, we decided to go with a large k (i.e. 100) so that we can make a greater number of predictions.