I spend so much free time on reddit. So much, that I decided what I really needed was a way to spend even more time browsing it.

So, naturally, I decided to download some reddit data and try to find some new subs. In doing so, I encountered a common problem with these sorts of recommendation situations. Often, recommendations are for what everybody does, rather than being personalized for me and my situation. Relevancy scoring is a simple way to avoid this problem.

Code

You can find the code on github, of course.

Demo: type a subreddit

This takes a few seconds to load, it needs to pull in the subreddit data.

Relevancy scoring

Let’s start with a really simple approach to the problem of finding new subreddits:

What other subreddits do people who comment on /r/programming most often comment on?

The answer is disappointing: they comment on the subs that everyone comments on. The top 5 subreddits most commented on are:

AskReddit

funny

pics

worldnews

todayilearned

Lame.

Scoring with probability

How about something a bit more refined?

For those who comment on /r/programming, which subreddits do they comment on unusually often?

This is called the relevancy score, and is simply P(comments on sub|comments on programming) / P(comments on sub) .

This doesn’t quite work either. The top 5 are:

learnbasketball

Hahahahaha

MozillaTech

VeganFood

Clear_Void

Now, I’ve never heard of any of these subs. The problem here is that we’re looking at cases with one or two comments, which means their scores are massive. The commonplace solution would be to add some “smoothing” to the denominator, making less-likely subs appear less often, but it’s not quite right.

Thinking about probability

I find that in my work, many, many questions in machine learning or data analysis can be solved by thinking of the problem probabalistically. In this case, we asked for some measure of how more relevant a subreddit is to a set of users, compared to the general baseline.

The mistake here is that we don’t have accurate probabilities. For these smaller subreddits, we estimate the probability of a visit to them by looking at the number of visits to them, but even small amounts of noise in visitation will change the results massively.

Instead, we need to ask:

What is the probaility of observing this level of commenting, by /r/programmer commenters, compared to baseline.

By re-phrasing as a probability, and including the possiblity of randomness in our measurements (comments), we’re able to solve the problem nicely!

I calculate how many standard deviations from the expected commenting level is the observed commenting level. In code:

def score_sub ( sub_id , given_sums , means ): """ sub_id: the ID of the subreddit of interest given_sums: the number of visits to different subs by people who visit /r/programming means: the global expected visit rate for different subs """ # prepare a binomial distribution, assuming the "global" case dist = stats . binom ( given_sums . sum (), means [ sub_id ]) # Find its average and std mu = dist . mean () sd = dist . std () # and calculate how many std-devs away our observed level of visitation is. dist = ( given_sums [ sub_id ] - mu ) / sd return dist

This works perfectly. The top 5 results are:

linux

ProgrammerHumor

compsci

Python

cpp

Other examples:

politics: technology, worldnews, news, science, atheism

funny: pics, AdviceAnimals, gifs, todayilearned, WTF

books: television, explainlikeimfive, booksuggestions, news, literature

Conclusion

This sort of relevancy scoring can be great for when you want simple recommendations, based on single items. They help work around the problem of popularity, where a recommendation tends to simply be what’s popular with everyone, and zoom in on the things that are particular to a group. At the same time, they aren’t overwhelmed by noise.