There's hardly a developer who doesn’t use GitHub. With all those stars, pulls, pushes and merges, GitHub has a plethora of data available describing the developer universe.

As a Data Scientist at Stream, my job is to develop recommender systems for our clients so that they can provide a better user experience for their customers. With that said, I wanted to see if I could build a recommendation system for a product that I use daily (similar to the tool I built for Instagram), as well as try out some new deep learning architectures and “big data” processing tools that I’ve been wanting to play with.

I chose to use Dask (which provides advanced parallelism for analytics, enabling performance at scale for the tools you love) for my “Big Data” processing needs. I like to think of it as out-of-core, parallel, Numpy, and Pandas. What’s not to love? For building the deep learning architectures, I decided to use PyTorch. PyTorch provides “Tensors and Dynamic neural networks in Python with strong GPU acceleration”. Its easy to use interface and superior debugging capabilities make PyTorch amazingly pleasant to work with.

In my mind, there were 3 main parts of building this recommender system:1) Downloading and processing data, 2) Building a recommender system,and 3) putting that system into a production environment.

First things first. Check out the demo!

Downloading and Processing Data

I know that data munging isn’t always fun or sexy. However, since it is such a large part of many data science and machine learning workflows, I wanted to go through how I handled processing over 600 Million events.

Downloading Data from the GitHub Archive

“In 2012, the community led project, GitHub Archive was launched, providing a glimpse into the ways people build software on GitHub, This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains activity data for more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files” [1].

That’s a good chunk of data to play with.This dataset includes over 20 different event types, everything from commits, comments, and starts for public event data. However,it does not contain private repo information (thankfully),no “‘view” or “clone” data. While this is awesome for privacy, it does provide a little bit of a handicap in providing recommendations, as viewing and cloning repos would provide excellent indications of interest. That said, there should still be plenty of interaction data to provide some cool insights. I ended up using data only created after 1/1/2017 to get at least a year of interaction data, and because it sounded like a nice number.

So… why not use only stars? It does seem like a great indicator of what a user is interested in, however, within both academia and industry, use of explicit data has fallen out of favor and the gold standard is to now use implicit data (any engagement event, that could signal a user’s interest in an activity). To keep it simple, we’re going to use all of the data that GitHub gives us and treat it as implicit data. That means that all 20+ events (including stars) will be treated the same as an implicit event. This ends up being about half a billion analytic events per year. Not too shabby. My computer, unfortunately, can’t store all that data into memory, which is where Dask comes in.

GitHub Archive updates once per hour and allows the end user to download .gz files of JSON data for each hour. After a bit of consideration, I ended up storing the data as parquet files, as it seemed like a natural fit, plus it’s the suggested file format for storing Dask Dataframes. It’s also almost as fast as HDF5 for reading into memory (just as fast with multiple cores) and compresses WAY better on disk.

The functions I used to iteratively download data can be seen below. For ongoing updates, I simply wrapped the “update_data” function into a simple cronjob that runs once a day before my model is re-trained.

https://gist.github.com/BalazsHoranyi/9d1e0b24fec5164b05542335792835c9

Processing Data

Alright, now that we have lots of data to play with, we need to do some serious preprocessing before we can dump it into any sort of model.

The end goal is to have to have an array of data with each interaction in a normalized integer form.

The workstation I was using has 32 cores and 64 GB of memory, however, the below should all be doable on a standard laptop. I did run it on my MacBook Pro with 16GB of memory and 8 cores. Some steps just took a lot longer due to the parallelization that Dask takes advantage of across multiple cores. One extremely nice thing about Dask is the monitoring that it provides on your tasks. An example visualization seen below is what helped me debug the following steps.

The steps were as follows:

Set up a local compute cluster for Dask, and define a computation graph to strip out user and repo names from JSON strings within a Dask Dataframe.

https://gist.github.com/BalazsHoranyi/b9b6a65f1bd15ba21806e544006a9b22

Turn Dask DataFrame into Dask array to take advantage of slicing capabilities and store to disk as Numpy stack to force freezing of current state of the computation.

https://gist.github.com/BalazsHoranyi/57156d25668a410e0ea0fb90750b6afe

Iterate through those stacks to find all unique repos and users to create user and item to id dictionaries. I was having some memory troubles on the final aggregation step using Dask to do the unique count entirely out of the core, so I ended up just iterating through it in chunks, then, mapping users and items and storing to disk one more time.

https://gist.github.com/BalazsHoranyi/89a72c051c4d25c66830b591f8c5e6e9

To reduce noise, repos with low engagement were removed from the training set (any repo with less than 50 associated interactions). I then mapped each user and item to a normalized index to get the format that we were striving for above.

https://gist.github.com/BalazsHoranyi/790a03440abffc2b8fa9a65ee7c43960

Whew, that was a lot of data munging. I know it’s not glamorous, but I end up spending a lot of my time doing this sort of work, so it seemed worth covering the plumbing instead of the just the cool shiny stuff. Now, on to the fun stuff!

Building a Recommender System Model

Collaborative filtering and matrix factorization approaches have been king in the recommender system space ever since the Netflix challenge. Today, sequence-based models have started to become increasingly prevalent. Fortunately, deep learning techniques can be applied to both.

Collaborative Filtering using Neural Matrix Factorization.

Neural Matrix Factorization is an approach to collaborative filtering introduced last year that tries to take advantage of some of the non-linearities the neural networks provides while keeping the generalization that matrix factorization provides. This is done by concatenating the two feature vectors extracted from a multilayer perceptron with an element-wise multiplication from item and user feature vectors.

A simple illustration can be seen below:

PyTorch was used as the building blocks for this network, and many ideas were taken from here:

https://github.com/maciejkula/spotlight and here: https://github.com/LaceyChen17/neural-collaborative-filtering

Making our network look like so:

https://gist.github.com/BalazsHoranyi/0b99c49dc275e3db305154b21a71a65b

Sequence-Based Models

Sequence-based models have been extremely popular in recommender systems lately. The main idea behind them is that instead of modeling a user as a unique identifier, users are modeled as their past x interactions. This provides a couple of very nice properties. New interactions on a user don’t need to trigger a new model rebuild to generate up-to-date recommendations, as it is all based on the past x item interactions. Additionally, they can generalize immediately to new users once they start clicking around. In situations like personalizing recommendations for e-commerce, where all the data you have is based on a single session, this is essential. Many of these ideas have been adapted from natural language processing where language models are used to predict the next character or word in a sentence.

A Mixture-of-tastes model tries to represent the diverse interests that users may have by trying to rank a user's interest in an item using separate taste vectors. It does this by representing these taste vectors as different feature maps using a CNN layer with a stride of one and a depth output of the number of taste vectors. For a nice breakdown of the CNN architectures, I highly recommend reading Convolutional Neural Networks for Visual Recognition

This idea seemed like it would work well here, as developers tend to have multiple interests. For instance, I’m mostly interested in machine learning related repos, however, I’m also interested in big data processing tools and backend web development, and it would be really nice if the model could somehow treat these as different subpopulations.

I highly recommend taking a look at the paper Maciej Kula wrote, as well as his implementation of the model. The only tweaks I made for my own model were to add some dropout layers and some ReLU activation functions between layers. This is the model that ended up being put into production for this project.