Problem: You are a web programmer. You have users. Your users read stuff on your site, but don’t rate them. You want to make a list of popular content that users might want to click. You need some sort of “score” to sort by.

WRONG SOLUTION #1: Score = Number of pageviews

Why it is wrong: If your site is old enough, you probably have content that was popular in the past, but that’s not relevant anymore.

WRONG SOLUTION #2: Score = Number of pageviews in the last x hours.

Why it is wrong: “Trending” is a better solution than absolute popularity, but it lacks serendipity. There’s no need to show content that users would probably find anyway. If something is trending, it’s probably already on your home page.

Intermission I: Implicit feedback

If users are rating your content, you should go read Evan Miller’s suggestion from 8 years ago, and not sort by average rating. That’s usually not the case though. More often than not, we have implicit feedback, instead of explicit ratings. What’s that?

Implicit feedback is user activity that can be used to indirectly infer user preferences, e.g. clicks, page views, purchase actions. Sometimes only positive feedback is known, e.g. the products customers have bought, but not the ones they have decided against.

It may sound hard to infer user preferences by implicit feedback, but it works very well in practice. The main reason is that it’s much easier to collect clicks and page views than explicit ratings. And if you have enough data, the noise in the input won’t matter much.

If you have a lot of registered users, you can (and should) build a recommender system with this data. But even if you build one, how do you recommend items (blog posts, articles, music, video, products, etc.) to new users if you know nothing about them?

This is called the cold start problem. When you have some information about the user, you can recommend items that are similar to what she has consumed before, but you can’t do that for new users. The most simple approach is to show “popular” content, but we can do much better than the naive solution.

The main trick is to not track what people are viewing only, but what were the options that were presented to them (impressions) and what they decided to click on.

WRONG SOLUTION #3: Score = CTR = clicks / impressions

If we divide the number of clicks by the number of impressions an item had, we have the CTR (click through rate). That means how many clicks on average one item gets each time it’s recommended. Good items will have a high CTR.

Why it is wrong: The issue here is that it’s hard to estimate the CTR of an item if it has few impressions. Let’s say that one article was recommended one time and not clicked. Would you believe that the CTR is zero and that you should never recommend it again? Or if the item was shown once and clicked, would you recommend it to every user?

Disclaimer: CTR is not the only and probably not the best metric to evaluate recommendations quality. We don’t want to show click-bait content that users don’t actually like, but this is out of the scope of this article.

Let’s get back to our problem. How can we get a reliable estimative of items’ CTR?

WRONG SOLUTION #4: Score = LCB(CTR)

LCB here stands for Lower Confidence Bound. It means the minimum CTR that the item is guaranteed to have with a degree of confidence that you can specify. To calculate that, we can use the Wilson Score, from Evan Miller’s suggestion again. We just need to consider every click as a “thumbs up” and every impression that was not clicked as a “thumbs down”. It will take into account the uncertainty of the CTR if we have few examples.

Why it is wrong: This solution works fine in practice, but it’s still wrong. It’s good at exploiting articles that have a good CTR, but it does not explore new articles. Maybe a new article will have a better CTR, but we can get stuck showing a suboptimal one because we don’t have a good estimative for the others.

If you understood that, congratulations! This is the essential concept behind Multi-Armed Bandits theory. You already know enough terminology to discuss MAB at a cocktail party in Las Vegas. But what is MAB?

Intermission II: Multi-Armed Bandits

By Yamaguchi先生 at the English language Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=57295504

Wikipedia definition explains where this odd name comes from:

Multi-armed bandit problem is a problem in which a gambler at a row of slot machines (sometimes known as “one-armed bandits”) has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.

MAB are best know in the industry as an alternative to A/B tests, as a way of picking the best variant of your website. In our case, we can see each article as a different arm, and each click as a reward. Some articles will give a better reward because they have a better average CTR.

GOOD SOLUTION #1: Score = Random for some users, LCB for the rest

If we choose a small fraction (epsilon, ε) of the users and show random articles to them, we will be able to explore all the items, while still showing the best for most of them. This is called ε-greedy and works very well, but it’s hard to pick a good value for ε. If it’s too small, we won’t be able to explore all items. If it’s too large, too many users will receive poor recommendations.

GOOD SOLUTION #2: Score = UCB(CTR)

If we use the UCB (Upper Confidence Bound) instead of the LCB, we’ll have a very good MAB policy. While the LCB gives a minimum CTR estimative for the item, UCB returns the maximum CTR that the item can have, given the desired confidence.

It is able to try new arms until it has a reasonable estimative of the CTR, and will recommend the best articles more often. We say that it has a low regret. That’s essentially how many clicks it will loose if compared to the best possible policy, that always recommends the best arms. One problem is that it might not work very well if you are not updating the clicks and impressions in real time. We can do even better than that.

BEST SOLUTION: score = beta(clicks + 1, impressions - clicks + 1)

Also known as Thompson Sampling.

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the art methods.

The quote above is from the article Thompson Sampling for Contextual Bandits with Linear Payoffs. Xavier Amatriain, a former machine learning researcher who went on to lead the recommender systems team at Netflix, and later Quora, said the same thing in this episode of the podcast TWiML:

There’s a lot of literature about Multi-Armed Bandits, but there’s only one that works in practice. I don’t know if I want to give that away (…) It’s pretty well know in the industry that Thompson Sampling is the easiest and more practical approach to MAB.

Sounds good, right? If you look at the wikipedia article about Thompson Sampling, though, the math can be very intimidating. Fortunately, it’s actually very simple to implement and understand it.

So, what’s Thompson Sampling? It’s named after William R. Thompson, who had this idea in 1933. It’s called sampling because for each arm, we pick (sample) a random CTR ratio from our current belief of its probability distribution, and then we sort the articles based on the sampled CTR.

But what’s the probability distribution for the CTR? When we used LCB and UCB we were acknowledging that we don’t know what the exact CTR value is, so we do an educated guess of its value. The probability distribution is just a function that returns the odds of the CTR being any particular value.

And how does it looks like? Well, for a new article, we have no idea what the CTR is. It could be any number from 0 to 1. That’s called a uniform distribution. But every time we show an article, it’s like flipping an unfair coin. We flip it and see the result. Let’s say that if it’s head, we get a click. Every time we toss it and see the result, we can update our belief of the heads probability or article CTR.

Here is an example of how this distribution (named Beta) looks like after we do some tosses, depending on how many heads we get:

This idea of acknowledging that we don’t know everything and revising our beliefs in light of new information is the main idea behind what’s called Bayesian statistics. If you want to learn more about this, I recommend the book Probabilistic Programming & Bayesian Methods for Hackers. It’s not trendy as deep learning, but it’s still interesting and useful. And it’s another great topic you can discuss at your cocktail party!

So, what does it mean to use score = beta(clicks + 1, impressions - clicks + 1) ? We are just getting a random CTR, sampled from the beta distribution. We then sort the articles according to this randomized score. Articles that have a good CTR, will be shown (exploited) more often, but we will also explore new items as they come along. It’s amazing how simple it’s and how good the results are.

A final test to make sure you understood all the concepts: You should find this cartoon funny. :)

Thanks to Hugo Lopes Tavares, Leandro Moreira, Daniel Martins, Flávio Ribeiro, Rafael Carício, and Igor Sobreira for reading early drafts of this article.