Last post I described how I collected implicit feedback data from the website Sketchfab. I then claimed I would write about how to actually build a recommendation system with this data. Well, here we are! Let’s build.

I think the best place to start when looking into implicit feedback recommenders is with the model outlined in the classic paper “Collaborative Filtering for Implicit Feedback Datasets” by Koren et.al. (warning: pdf link). I have seen many names in the literature and machine learning libraries for this model. I’ll call it Weighted Regularized Matrix Factorization (WRMF) which tends to be a name used fairly often. WRMF is like the classic rock of implicit matrix factorization. It may not be the trendiest, but it will never go out of style. And, everytime I use it, I know that I’m guaranteed to like what I get out. Specifically, this model makes reasonable intuitive sense, it’s scalable, and, most importantly, I’ve found it easy to tune. There are much fewer hyperparameters than, say, stochastic gradient descent models.

If you recall from my post on Explicit Feedback Matrix Factorization, we had a loss function (without biases) that looked like:

$$L_{exp} = \sum\limits_{u,i \in S}(r_{ui} - \textbf{x}_{u}^{\intercal} \cdot{} \textbf{y}_{i})^{2} + \lambda_{x} \sum\limits_{u} \left\Vert \textbf{x}_{u} \right\Vert^{2} + \lambda_{y} \sum\limits_{u} \left\Vert \textbf{y}_{i} \right\Vert^{2}$$

where $r_{ui}$ is an element of the matrix of user-item ratings, $\textbf{x}_{u}$ ($\textbf{y}_{i}$) are user $u$'s (item $i$'s) latent factors, and $S$ was the set of all user-item ratings. WRMF is simply a modification of this loss function:

$$L_{WRMF} = \sum\limits_{u,i}c_{ui} \big( p_{ui} - \textbf{x}_{u}^{\intercal} \cdot{} \textbf{y}_{i} \big) ^{2} + \lambda_{x} \sum\limits_{u} \left\Vert \textbf{x}_{u} \right\Vert^{2} + \lambda_{y} \sum\limits_{u} \left\Vert \textbf{y}_{i} \right\Vert^{2}$$

Here, we are not summing of over elements of $S$ but instead over our entire matrix. Recall that with implicit feedback, we do not have ratings anymore; rather, we have users’ preferences for items. In the WRMF loss function, the ratings matrix $r_{ui}$ has been replaced with a preference matrix $p_{ui}$. We make the assumption that if a user has interacted at all with an item, then $p_{ui} = 1$. Otherwise, $p_{ui} = 0$.

The other new term in the loss function is $c_{ui}$. We call this the confidence matrix, and it roughly describes how confident we are that user $u$ does in fact have preference $p_{ui}$ for item $i$. In the paper, one of the confidence formulas that the authors consider is linear in the number of interactions. If we take $d_{ui}$ to be the number of times a user has clicked on an item on a website, then

$$c_{ui} = 1 + \alpha d_{ui}$$

where $\alpha$ is some hyperparameter determined by cross validation. In the case of the Sketchfab data, we only have binary “likes”, so $d_{ui} \in {0, 1}$

To take a step back, WRMF does not make the assumption that a user who has not interacted with an item does not like the item. WRMF does assume that that user has a negative preference towards that item, but we can choose how confident we are in that assumption through the confidence hyperparameter.

Now, I could go through the whole derivation in gory latex of how to optimize this algorithm á la my previous explicit MF post, but other people have already done this many times over. Here’s a great StackOverflow answer, or, if you like your derivations in Dirac notation, then checkout Sudeep Das’ post.

WRMF Libraries

There are a number of places to find open source code which implements WRMF. The most popular method of optimizing the loss function is through Alternating Least Squares. This tends be less tricky to tune than stochastic gradient descent, and the model is embarrassingly parallel.

The first code I saw for this algorithm was from Chris Johnson’s repo. This code is in python, nicely makes use of sparse matrices, and generally gets the job done. Thierry Bertin-Mahieux then took this code and parallelized it using the python multiprocessing library. This provides a decent speedup with no loss in accuracy.

The people at Quora came out with a library called qmf which is parellized and written in C++. I haven’t used it, but it is presumably faster than the multiprocessing python version. Lastly, Ben Frederickson went and wrote parallelized code in pure Cython here. This blows the other python versions out of the water in terms of performance and is somehow faster than qmf (which seems odd).

Anywho, I ended up using Ben’s library for this post because (1) I could stay in python, and (2) it’s super fast. I forked the library and wrote a small class to wrap the algorithm to make it easy to run grid searches and calculate learning curves. Feel free to check out my fork here, though I didn’t write any tests so use at your own risk :)

Massaging the data

Now that that’s out of the way, let’s train a WRMF model so we can finally recommend some Sketchfab models!