Update 7/8/2019: Upgraded to PyTorch version 1.0. Removed now-deprecated Variable framework

Update 8/4/2020: Added missing optimizer.zero_grad() call. Reformatted code with black

Hey, remember when I wrote those ungodly long posts about matrix factorization chock-full of gory math? Good news! You can forget it all. We have now entered the Era of Deep Learning, and automatic differentiation shall be our guiding light.

Less facetiously, I have finally spent some time checking out these new-fangled deep learning frameworks, and damn if I am not excited.

In this post, I will show you how to use PyTorch to bypass the mess of code from my old post on Explicit Matrix Factorization and instead implement a model that will converge faster in fewer lines of code.

But first, let’s take a trip down memory lane.

The Dark Ages

If you recall from the original matrix factorization post, the key to the derivation was calculus. We defined a loss function which was the mean squared error (MSE) loss between the matrix factorization “prediction” and the actual user-item ratings. In order to minimize the loss function, we of course took the derivative and set it equal to zero.

For the Stochastic Gradient Descent (SGD) derivation, we iterated through each sample in our dataset and took the derivative of the loss function with respect to each free “variable” in our model, which were the user and item latent feature vectors. We then used that derivative to smartly update the values for the latent feature vectors as we surfed down the loss function in search of a minima.

I just described what we did in two paragraphs, but it took much longer to actually derive the correct gradient updates and code the whole thing up. The worst part was that, if we wanted to change anything about our prediction model, such as adding regularization or a user bias term, then we had re-derive all of the gradient updates by hand. And if we wanted to play with new techniques, like dropout? Oof, that sounds like a pain to derive!

The Enlightenment

The beauty of modern deep learning frameworks is that they utilize automatic differentiation. The phrase means exactly what it sounds like. You simply define your model for prediction, your loss function, and your optimization technique (Vanilla SGD, Adagrad, ADAM, etc…), and the computer will automatically calculate the gradient updates and optimize your model.

Additionally, your models can often be easily parallelized across multiple CPU’s or turbo-boosted on GPU’s with little effort compared to, say, writing Cython code and parallelizing by hand.

Minimum Viable Matrix Factorization

We’ll walk through the three steps to building a prototype: defining the model, defining the loss, and picking an optimization technique. The latter two steps are largely built into PyTorch, so we’ll start with the hardest first.

Model

All models in PyTorch subclass from torch.nn.Module, and we will be no different. For our purposes, we only need to define our class and a forward method. The forward method will simply be our matrix factorization prediction which is the dot product between a user and item latent feature vector.

In the language of neural networks, our user and item latent feature vectors are called embedding layers which are analogous to the typical two-dimensional matrices that make up the latent feature vectors. We’ll define the embeddings when we initialize the class, and the forward method (the prediction) will involve picking out the correct rows of each of the embedding layers and then taking the dot product.

Thankfully, many of the methods that you have come to know and love in numpy are also present in the PyTorch tensor library. When in doubt, I treat things like numpy and usually get 90% there.

We’ll also make up some fake ratings data for playing with in this post.