'Most popular' metrics in App Engine

Posted by Nick Johnson | Filed under python, coding, app-engine, tech, datastore

One useful and common statistic to provide to users is a metric akin to "7 day downloads" or "7 day popularity". This appears across many sites and types of webapp, yet the best way to do this is far from obvious.

The naive approach is to record each download individually, and use something akin to "SELECT count(*) FROM downloads WHERE item_id = 123 AND download_date > seven_days_ago", but this involves counting each download individually - O(n) work with the number of downloads! Caching the count is an option, but still leads to excessive amounts of work at read-time.

Another option is to maintain an array of daily download counts, keeping the last 7. This is an improvement from a workload point of view, but leads to either discontinuities at the start of a new day, or to all counts being updated only once per day.

There is a third option, however, which has the performance of the second option, with the responsiveness of the first. To use it, however, we have to reconsider slightly what we mean by '7 day popularity'. The solution in question is to use an exponential decay process. Each time an event happens, we increase the item's popularity figure, which continuously decays over an extended period, ensuring that the figure provides a representation of recent popularity.

In order to figure out at what rate our popularity figure should decay, we can look to the 'mean lifetime' formula. If we think about the current popularity as a set, the mean lifetime figure allows us to calculate a decay rate such that an element (for example, an individual download) will remain in the set for a specified average amount of time - for example, 7 days.

The relevant formula is N(t) = N 0 e-t/r, where N 0 is the value at the start, t is the amount of time elapsed, and r is the decay rate, which is simply the inverse of the mean lifetime. Here's how we calculate it in practice:

def timedelta_to_seconds(delta): return delta.days * 86400 + delta.seconds + delta.microseconds / 1000000.0 MEAN_DOWNLOAD_LIFETIME = timedelta_to_seconds(datetime.timedelta(days=7)) def decay(value, elapsed): # What fraction of the mean lifetime has elapsed? decay_fraction = timedelta_to_seconds(elapsed) / MEAN_DOWNLOAD_LIFETIME return value * (math.e ** -decay_fraction)

In the above code, we define MEAN_DOWNLOAD_LIFETIME as the number of seconds in 7 days, and the function decay 'decays' a value, given the amount of time elapsed since the value was last evaluated.

Of course, in practice we can't have a value that continuously decays, much less store one in the datastore. The solution is to use an approximation: We will store the current value of the popularity metric, along with how long it's been since we last applied the decay function. Each time we increment the popularity metric, we'll first apply the decay function to make sure it's up to date. Here's an example model:

class File(db.Model): popularity = db.FloatProperty(required=True, default=0.0) last_decay = db.DateTimeProperty(required=True, auto_now_add=True) def update_popularity(self, delta): now = datetime.datetime.now() self.popularity = decay(self.popularity, now - self.last_decay) + delta self.last_decay = now

The standard caveats when dealing with frequently updated values apply here: update_popularity should be called inside a transaction that fetches and puts the File object, and depending on the expected update frequency, you should use a technique to alleviate contention issues when updating the record - my unsharded counter recipe would be a good choice here.

The solution we have now, then, allows us to search and rank based on popularity in an efficient manner, at the cost of using a slightly different metric (exponential decay of popularity, rather than absolute download counts). You've probably realized that the ranking may not be entirely fair, however: A slightly less popular, but less recently updated entity may be ranked higher than a more popular but more recently updated entity. For the most part, this will be taken care of by regular updates to the entities, but we still need a way to clean up after the stragglers. This can be achieved with a scheduled task that applies the decay function to any particularly out-of-date entities: