Deriving the Reddit Formula

By Evan Miller

July 13, 2015

A few things about Reddit’s hot formula have always bothered me. The formula has obviously been a success when it comes to setting the Internet on fire, but I have to wonder:

Where do the seemingly arbitrary constants come from, and how do they effect the rankings? Why doesn’t the current time appear in the calculation? Why is there a logarithm? What’s with taking the absolute value of s ?

I decided I would try to derive a “hot” formula from expected-utility theory to see if I could shed any light on the Reddit formula. Much to my surprise, the formula I came up with is remarkably similar to Reddit’s. It explains the origin of the strange constants, the presence of the logarithm, and the absence of the current time. One interesting feature of the Reddit formula is that, according to my analysis, the formula is optimized for users who visit the site every 5.43 hours. (Who knew?)

The derivation also points to an expression I believe is missing from the Reddit formula, one that would help bring highly-rated items to the top. At the end of this post I’ll propose an amendment to the existing formula — which, before you get too excited, comes with important caveats — but in the meantime, let’s dig into some math.

Reddit Utility Theory

Suppose a Reddit visitor loads the Front Page of the Internet, on average, every \(S\) seconds. Let’s assume that a given Reddit visitor reloads the page basically at random, but maintains an average reload rate throughout the day of \(\lambda = 1/S\). (If \(S=120\), the reload rate is \(\lambda = 1/120\).)

When a Reddit visitor sees a story, she might have four reactions:

It’s a new story that she likes It’s a new story that she dislikes It’s an old (previously seen) story that she likes It’s an old story that she dislikes

Let’s associate a utility payoff with each of these scenarios:

\(a = \) It’s a new story that she likes \(b = \) It’s a new story that she dislikes \(c = \) It’s an old (previously seen) story that she likes \(d = \) It’s an old story that she dislikes

To determine the relative probability of these four events, we need to know two things:

What’s the probability that a story will be liked? What’s the probability that a story is new (to a random visitor)?

For now let’s call these two probabilities \(p\) (probability a story is liked) and \(q\) (probability a story is new). To make things easier we can make a table of the four possible events with their utility payoffs and probabilities:

Story is liked? Story is new? Probability Payoff Liked New \(pq\) \(a\) Disliked New \((1-p)q\) \(b\) Liked Old \(p(1-q)\) \(c\) Disliked Old \((1-p)(1-q)\) \(d\)

The expected utility of a story in terms of the above probabilities and payoffs is then:

\[ u(p, q) = pq \times a + (1-p)q \times b + p(1-q) \times c + (1-p)(1-q) \times d \]

Now we just need to come up with expressions for \(p\) and \(q\). We can use the existing votes on an item to form a belief about what \(p\) might be. I’m simplifying a bit, but if we use the number of upvotes \(U\) and downvotes \(D\), we can come up with an expression for the expectation of \(p\):

\[ E[p] = \frac{U+1}{U+D+2} \]

(The +1 and +2 come in when you formulate \(p\) as a random Bayesian parameter from a beta distribution; see Bayesian Average Ratings for more on this. Anyway, it’s not that important, it’s just the percent of votes that are upvotes after you stuff an upvote and a downvote into the ballot box.)

The value of \(q\) is more interesting. Remember I said Reddit visitors reload the page at random? This is called a Poisson process with an arrival rate (reload rate) of \(\lambda\). One nice property of Poisson processes is that there’s a simple formula for describing the time between events (reloads). The probability that it has been at least \(s\) seconds since the last reload is:

\[ q = e^{-\lambda s} \]

If a story is \(t\) seconds old, then \(e^{-\lambda t}\) is the probability a random visitor hasn’t seen it before. Equipped with this amazing fact, we can write out an expectation for \(u\) in terms of upvotes \(U\), downvotes \(D\), and story age \(t\):

\[ E[u(U, D, h)] = \frac{U+1}{U+D+2} e^{-\lambda t} a + \frac{D+1}{U+D+2}e^{-\lambda t} b + \frac{U+1}{U+D+2}(1-e^{-\lambda t})c + \frac{D+1}{U+D+2}(1-e^{-\lambda t}) d \]

Now we get to the fun part. Let’s assume that we get no utility for old stories (\(c = d = 0\)), utility of \(+1\) for a new story we like (\(a = 1\)), and utility of \(-1\) for a new story we dislike (\(b = -1\)). Plugging these values in, we get:

\[ E[u(U, D, h)] = \frac{U+1}{U+D+2}e^{-\lambda t} - \frac{D+1}{U+D+2}e^{-\lambda t} \]

Or more simply:

\[ E[u] = \frac{U-D}{U+D+2}e^{-\lambda t} \]

Look familiar? No? How about if we take a logarithm, which won’t affect relative ranks:

\[ \begin{equation} \label{ln_utility} \ln E[u] = \ln{(U-D)} - \ln{(U+D+2)} - \lambda t \end{equation} \]

Now compare that to the Reddit formula:

\[ f(t_s, y, z) = y\log_{10}{z} + \frac{t_s}{45000} \]

They’re similar, and yet… different.

Anatomy of the Reddit Formula

Here’s the original statement of the Reddit formula, modified to reflect a recent change:

Given the time the entry was posted \(A\) and the time of 7:46:43 a.m. December 8, 2005 \(B\), we have \(t_s\) as their difference in seconds \[ t_s = A - B \] and \(x\) as the difference between the number of up votes \(U\) and the number of down votes \(D\) \[ x = U - D \] where \(y \in \{-1, 0, 1\}\) \[ y = \left\{ \begin{array}{lr} 1 & : x \gt 0 \\ 0 & : x = 0 \\ -1 & : x \lt 0 \end{array} \right. \] and \(z\) as the maximal value of the absolute value of \(x\) and \(1\) \[ z = \left\{ \begin{array}{lr} |x| & : |x| \ge 1 \\ 1 & : |x| \lt 1 \end{array} \right. \] we have the rating as a function \(f(t_x, y, z)\) \[ f(t_s, y, z) = y\log_{10}z + \frac{t_s}{45000} \]

There are a lot of moving parts there, so to simplify things, let’s assume \(U \gt D\). The Reddit formula then reduces down to:

\[ f(A, U, D) = \log_{10}{(U-D)} + \frac{A-B}{45000} \]

Because the rank values are relative, we can add, subtract, and divide \(f\) by positive numbers without changing the order of ranked items. So we’ll do a few rank-preserving transformations on \(f\) to see if we can beat it into formula \(\eqref{ln_utility}\). First divide by \(\log_{10}{(e)}\):

\[ f = \frac{\log_{10}{(U-D)}}{\log_{10}(e)} + \frac{A-B}{45000\log_{10}(e)} \]

If you remember your Algebra II classes (and have a TI-83 handy) you know that the above reduces to:

\[ f = \ln{(U-D)} + \frac{A-B}{19543} \]

Now we’re going to subtract an innocuous constant, the time between now (time \(NOW\)) and Reddit’s birthday \(B\), divided by the magic number 19543:

\[ f = \ln{(U-D)} + \frac{A-B}{19543} - \frac{NOW-B}{19543} \]

And reduce:

\[ f = \ln{(U-D)} + \frac{A-NOW}{19543} \]

Of course, the age of the story \(t\) is equal to \(NOW - A\), so we can rewrite this as:

\[ f = \ln{(U-D)} - \frac{t}{19543} \]

(This resolves one of the original mysteries — why the current time doesn’t appear in Reddit’s formula. It turns out you can formulate the rankings using Reddit’s birthday instead of the current time, without affecting the relative positions of stories. Rank values don’t shrink over time — instead, new stories enter the world with higher rank values than their predecessors! Born on third base, the little brats.)

Anyway, setting \(\lambda = 1/19543\), we have:

\[ f = \ln{(U-D)} - \lambda t \]

Which is remarkably close to the formula \(\eqref{ln_utility}\) we derived above. That is, whether the site designers knew it or not, the Reddit “hot” formula is close to being an optimal embodiment of utility theory under a fixed-payoff, random-reload model. Incidentally, the value of \(\lambda\) here implies that the Reddit formula is optimized for visitors who load the site every 19,543 seconds, or once every 5.43 hours. (If you load the Reddit home page more often than this, you’re probably wasting your time!)

But there are a couple of complications here: one is that the log of \((U-D)\) doesn’t exist when \(U\le D\), which is why the Reddit formula has all the weird stuff with \(y\) and \(z\). The other is that the Reddit formula, as implemented, is missing a term.

The Missing Term

Even after applying all kinds of order-preserving transformations, the Reddit formula is still missing a term compared to the formula we derived using utility theory:

\[ \ln E[u] = \ln{(U-D)} - \ln{(U+D+2)} - \lambda t \]

The missing term of course is:

\[ -\ln{(U+D+2)} \]

The term makes more sense in terms of the original (non-logarithmed) utility formula:

\[ E[u] = \frac{U-D}{U+D+2}e^{-\lambda t} \]

The above formula divides the difference in upvotes and downvotes by the total number of votes. That is, it treats upvotes and downvotes in terms of proportions — and brings highly-rated items to the top, similar to how Reddit’s comment-sorting algorithm works. Without that divisor, the Reddit formula is highly biased in favor of items that receive a lot of votes, even if a large proportion of those votes are negative.

I noted before that the Reddit formula’s \(y\) and \(z\) terms were a bit weird. They’re a workaround for the fact that \(\ln(U-D)\) doesn’t exist for \(U\le D\). The workaround has been the subject of controversy before, which I won’t rehash, but I will point out that there’s a way to render the \(y\) and \(z\) terms unnecessary. If we return to Reddit Utility Theory and set the utility of a new, disliked story to zero (that is, \(b=0\)), the expected utility becomes:

\[ E[u] = \frac{U+1}{U+D+2}e^{-\lambda t} \]

Which has a nice logarithmic form that — happily — exists for all values of \(U\) and \(D\):

\[ \ln E[u] = \ln{(U+1)} - \ln{(U+D+2)} -\lambda t \]

(The logarithmic form makes the servers happy, since the exponential form tends to drive the result to zero or to infinity quickly.)

If I may make so bold as to adapt this formula to the Reddit source code, I would write:

cpdef double _hot(long ups, long downs, double date): """The hot formula. Should match the equivalent function in postgres.""" seconds = date - 1134028003 return round(log(ups + 1) - log(ups + downs + 2) + seconds / 19543, 7)

If you wanted to be risk-averse about promoting items with a small number of votes, you might increase the +1 and +2 terms to be larger numbers (which, in the Bayesian formulation, correspond to stronger prior beliefs about items).

Choosing Lambda

Should we keep using the same value of \(\lambda\) that Reddit currently has? I’m not sure. In addition to representing a reload rate, we can think of \(\lambda\) as a tradeoff between story age and the percent of votes that are positive. Under the new formulation, there’s a straightforward relationship between the two. Writing the fraction of positive votes as \(p\), we can calculate the substitution rate between \(p\) and \(t\) required to maintain a constant score:

\[ \frac{dE[u]/dt}{dE[u]/p} = \frac{-\lambda p e^{-\lambda t}}{e^{-\lambda t}} = -\lambda p = \frac{-\Delta p}{\Delta t} \] \[ \implies 1/\lambda = \frac{\Delta t}{\Delta p / p} \]

That is, if we decide an hour of age is equal to a five percent drop in \(p\), we would choose:

\[ 1/\lambda = \frac{3600}{.05} = 72,000 \]

Or a reload rate of once every 20 hours. Likewise, a reload rate of once every 5.43 hours implies that an hour of age is equivalent to:

\[ 19543 = \frac{3600}{\Delta p / p} \] \[ \frac{\Delta p}{p} \approx 0.18 \]

Or an 18 percent drop in the percent-positive rating. That figure may seem a bit harsh, but such is the cost of keeping fresh content up front at all times.

This final calculation highlights one of the major shortcomings of the present model: it treats stories as having a fixed payoff (\(a =\) “I like it”, \(b = \) “I don’t like it”), whereas reality is a bit messier. That is, extremely high quality links with a higher-than-normal utility payoff will tend to be swept away too quickly by this algorithm. So maybe we shouldn’t blow away the existing Reddit formula just yet.

Conclusion

I realize that proposing any change to how Reddit works is one of the Internet’s most dangerous games, so I hesitate to beat the drum in favor of MillerSort™. But I believe that expected-utility theory and a simple random-reload model can help explain why the Reddit formula has been so effective in the past, and shine a light on aspects that might be improved. In particular, the Reddit formula should probably take into account the percent of votes that are positive, rather than just taking the difference between positive and negative votes.

An even better formulation would figure out how to deal with stories of extremely high quality, and a more sophisticated model would better take into account the cost of mistakenly overranking a bad item, similar to how comment-sorting works. But the formulation above seem like a good place to start.

For the next 5.43 hours, at any rate.

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy: Ranking News Items With Upvotes

How Not To Sort By Average Rating

Bayesian Average Ratings

Ranking Items With Star Ratings: An Approximate Bayesian Approach

Get new articles as they’re published, via Twitter or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!



Wizard

Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – Twitter – YouTube