Inferring Tweet Quality From Retweets

By Evan Miller

July 17, 2015

Twitter reports the number of times that each tweet has been retweeted. This is a pretty good measure of social approbation that a particular tweet has garnered, but I often find myself clicking on the author profile to help me decide whether a tweet receiving hundreds of retweets is truly excellent, or whether the author is simply riding the unquestioning adoration of hundreds of thousands of pre-existing followers. What I really want to know is: what percent of people who see a given tweet will retweet it to their followers?

If you divide a given tweet’s retweet count by the number of followers that the author has, you might get a rough estimate of this figure. But that leaves out a couple of factors. Old tweets have more time to accumulate retweets than new tweets, so this naive method will tend to underestimate the quality of newer tweets. In addition, if a tweet “goes viral”, and is retweeted outside the author’s immediate follower list, the calculated quality will tend to be exaggerated.

Here I want to develop a model to estimate a tweet’s quality (percent of readers who will retweet it) that takes into account the passage of time, as well as retweeting behavior outside the author’s immediate network. Can we come up with a formula for tweet quality using only the information that Twitter provides publicly?

Modeling Tweets, Reads, and Retweets

Similar to Deriving the Reddit Formula, assume that Twitter users reload Twitter at random throughout the day, with a reload rate of \(\lambda\). On each reload, the user reads through all new tweets, and decides whether to retweet any of them. Assume that a given tweet will be retweeted by a fraction of users who read it equal to \(p\) (the quality metric).

Let \(n_1\) be the number of followers that a Twitter user has. Let \(n_2\) be the number of second-order followers, that is, the number of additional users that could be reached if all of the first-order followers retweeted. In general, let \(n_d\) represent the number of users at distance \(d\) in the follow network. Let \(q_d(t)\) represent the fraction of users at distance \(d\) who saw the original tweet within \(t\) seconds of it publication, and let \(R_d(t) = n_d q_d(t)p\) be the number of users at distance \(d\) who retweeted it before time \(t\).

How many first-order followers will have read a tweet that is \(t\) seconds old? If users are reloading Twitter at random — that is, if their arrival is a Poisson process with parameter \(\lambda\) — the time-between-arrivals is described by an exponential distribution. So the fraction of users who will have reloaded Twitter in the last \(t\) seconds is given by:

\[ \int_0^t f(\tau) d\tau = \int_{0}^{t} \lambda e^{-\lambda \tau} d\tau = 1 - e^{-\lambda t} \]

(For clarity assume the original tweet occurred at time \(0\), and that the current time is \(t\).) Since all first-order followers are reading all new tweets, that expression is also the fraction of \(n_1\) that will have read a tweet of age \(t\), and so:

\[ q_1(t) = 1-e^{\lambda t} \]

Now, how many second-order followers will have read a tweet that is \(t\) seconds old? Here things get a bit more interesting. We first need to figure out the number of first-order followers who 1) read the orignal tweet and 2) decided to retweet it. Then we need to figure out how many second-order followers saw the tweet after it was retweeted. Even if there are many more second-order followers than first-order followers, they’ll be underrepresented at small \(t\) in part because they won’t have had very much time to see the retweets.

Consider a second-order follower. Assuming the first-order follower saw the tweet at time \(\tau \le t\), the probability that the second-order follower reloads Twitter between \(\tau\) and \(t\) is given by:

\[ q_1(t-\tau) = 1-e^{-\lambda(t-\tau)} \]

Multiplying by the (first-order) retweet probability \(p\), and integrating the above expression over \(\tau\), we have:

\[ q_2(t) = \int_0^t \lambda e^{-\lambda t} pq_1(t-\tau) d\tau = \int_0^t \lambda e^{-\lambda \tau}p(1-e^{-\lambda(t-\tau)}) d\tau = p(1 - e^{-\lambda t} - \lambda t e^{-\lambda t}) \]

That is, the fraction of second-order followers who see the tweet is the probability of the first-order follower seeing it (\(1-e^{-\lambda t})\) and retweeting it (\(p\)), minus an additional factor to account for the probability that the second-order follower will actually see the retweet.

Consider now a third-order follower. Assuming again the first-order follower saw the tweet at time \(\tau \le t\), we can think of the second- and third-order followers as being identical to the first- and second-order followers in the \(q_2\) previous calculation, except operating in a compressed time frame \(t-\tau\). So the conditional probability of the third-order follower seeing the tweet is given by:

\[ q_2(t-\tau) = p(1-e^{-\lambda(t-\tau)} - \lambda (t-\tau) e^{-\lambda t}) \]

Again we can multiply by the retweet probability \(p\) and integrate over \(\tau\):

\[ q_3(t) = \int_0^t \lambda e^{-\lambda \tau} pq_2(t-\tau) = \int_0^t \lambda e^{-\lambda \tau}p^2(1-e^{-\lambda(t-\tau)} - \lambda (t-\tau)e^{-\lambda(t-\tau)}) d\tau \\ q_3(t) = p^2 (1 - e^{-\lambda t} - \lambda t e^{-\lambda t} - \frac{1}{2} \lambda^2 t^2 e^{-\lambda t}) \]

Repeating this calculation, we find the general form:

\[ q_d(t) = p^{d-1}\left(1-\sum_{n=0}^{d-1} \frac{(\lambda t)^n e^{-\lambda t}}{n!}\right) \]

The term inside the parentheses is equivalent to the upper tail of a Poisson distribution parameterized by \(\lambda t\). This is interesting for a couple of reasons. We’ve found an unexpected equality: the probability of there occurring an ordered series of \(m\) events from \(m\) ordered, independent Poisson processes (that is, the probability that user 1 loads the page, and user 2 loads the page sometime after user 1, and user 3 loads the page sometime after user 2, and so on) is equal to the probability that one Poisson process will produce \(m\) or more events in the same amount of time (that is, user 1 loads the page at least \(m\) times). This makes sense if you think about it — after user 1 has loaded the page once, the probability of user 1 loading the page again, and the probability of user 2 loading the page, are identical, thanks to the memorylessness of Poisson processes. (Perhaps we should have used this insight and spared ourselves some integrals!)

The other interesting aspect is that we can rewrite this upper tail in terms of the regularized lower incomplete gamma function:

\[ q_d(t) = p^{d-1}P(d, \lambda t) \]

So then the number of retweets by users at distance \(d\) is given by:

\[ R_d(t) = n_d q_d(t) p = n_d p^d P(d, \lambda t) \]

Finally we have a simple equation that includes the tweet quality \(p\) in terms of the reload rate \(\lambda\), the tweet age \(t\), the follower counts \(n_d\), and the total number of retweets \(R(t)\):

\[ \begin{equation} \label{retweet_sum} R(t) = \sum_d R_d(t) = \sum_{d=1}^\infty n_d p^d P(d, \lambda t) \end{equation} \]

In practice, only the first few terms of the sum will matter. (A recent paper by Twitter states that “all retweet trees but for a handful have a height smaller than 6”.) To solve for \(p\), the main practical problem is coming up with numbers to use for \(n_d\), since Twitter only publishes \(n_1\) on each profile (that is, the number of direct followers).

Estimates and Example

One way to estimate \(n_2\) is to multiply \(n_1\) by the expected number of followers per user across all of Twitter. A different paper by Twitter fit the follow-count to a log-normal with estimated parameters \(\mu = 2.83\) and \(\sigma^2 = 3.36\), implying a mean follower count of:

\[ \bar{n} = e^{\mu+\sigma^2/2} = 90.9 \]

A naive estimate of \(n_d\) would be \(n_1 \bar{n}^{(d-1)}\) — that is, simply multiply follower counts by follower counts — but this is certain to break down as \(d\) gets larger and the connectedness, clustering, and finiteness of the network come into play.

Nonetheless, solving the equation for \(d \le 2\) — that is, considering only direct retweets and retweets of retweets — may be instructive. Using the estimate \(n_2 \approx n_1 \bar{n}\), we can write:

\[ R(t) = n_1 p P(1, \lambda t) + n_1 \bar{n} p^2 P(2, \lambda t) \]

Using the quadratic equation:

\[ p = \frac{-n_1 P(1, \lambda t) + \sqrt{(n_1 P(1, \lambda t))^2 + 4R(t) n_1 \bar{n} P(2, \lambda t)}}{2n_1 \bar{n} P(2, \lambda t)} \]

Or:

\[ p = \frac{1}{2\bar{n}} \left( \sqrt{\left(\frac{P(1, \lambda t)}{P(2, \lambda t)}\right)^2 + 4\bar{n} \frac{R(t)/n_1}{P(2, \lambda t)}} - \frac{P(1, \lambda t)}{P(2, \lambda t)}\right) \]

Where:

\[ \begin{array}{ll} P(1, \lambda t) &=& 1 - e^{-\lambda t}\\ P(2, \lambda t) &=& 1 - e^{-\lambda t} - \lambda t e^{-\lambda t} \end{array} \]

One interesting aspect of the quadratic form of the equation is that the number of retweets and the number of immediate followers only enter the equation as a ratio \(R(t)/n_1\). That is, when two tweets are the same age, it’s safe to compare them simply by dividing the number of retweets by the respective author’s follower count. (But that’s only because the quadratic form deliberately ignores “viral” tweets that are outside the two-retweet network.)

To compare tweets of different vintages, it’s necessary to run the numbers. As a quick example, consider two hypothetical tweets. Tweet A was published an hour ago by someone with 1,000 followers, and has 25 retweets. Tweet B was published 30 minutes ago by someone with 2,000 followers, and also has 25 retweets. Which one is better?

Assuming \(\bar{n} = 91\) and a reload rate of once per hour (\(\lambda = 1\)):

\[ p_A = \frac{1}{2\times 91} \left( \sqrt{\left(\frac{P(1, 1)}{P(2, 1)}\right)^2 + 4\times 91 \frac{25/1000}{P(2, 1)}} - \frac{P(1, 1)}{P(2, 1)}\right) = 0.02167 \\ p_B = \frac{1}{2\times 91} \left( \sqrt{\left(\frac{P(1, 0.5)}{P(2, 0.5)}\right)^2 + 4\times 91 \frac{25/2000}{P(2, 0.5)}} - \frac{P(1, 0.5)}{P(2, 0.5)}\right) = 0.02183 \]

Tweet B just barely ekes out a victory.

Conclusion

A random-reload model gets us to an equation for tweet quality mostly in terms of things that are observable. The problematic term in equation \(\eqref{retweet_sum}\) (besides the unknown reload rate \(\lambda\)) is the value of \(n_d\) for \(d \ge 2\). The approximation \(n_2 \approx n_1 \bar{n}\) probably isn’t too off the mark, but to estimate higher \(d\), more will need to be known about the structure of the Twitter follower network. That is, given that a user has \(n_1\) direct followers, how many second-, third-, and fourth-order followers can that user be expected to have? The current descriptions of the follower network characterize the distribution of \(n_1\), and describe the average path length between any two users, but for the present calculations, these are less useful than knowing the distribution of \(n_d\) conditional on \(n_1\).

Of course, since Twitter already tracks impression counts for each tweet, they could also just publish the retweet percentage next to each tweet, and render the foregoing analysis totally unnecessary. But what’s the fun in that?

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy: Deriving the Reddit Formula

Ranking News Items With Upvotes

Bayesian Average Ratings

Ranking Items With Star Ratings: An Approximate Bayesian Approach

Get new articles as they’re published, via Twitter or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!



Wizard

Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – Twitter – YouTube