Blog

SQL is great at grouping and counting the data you already have, and with a little help from regressions, SQL can help you project that data into the future.

Let’s estimate total users over time for a rapidly growing fictitious mobile app. The early days of this app were pretty messy, so we’ll chop them off and add them in as the starting sum, only plotting dates after October 2013. Instead of a range join to get the rolling sum, we’ll use a window function. To keep things organized, we’ll put each step into a with query:

with daily_new_users as ( select created_at::date dt, count(1) daily_ct from users where created_at > '2013-10-01' group by 1), daily_user_volume as ( select dt, dt - '2013-10-01' as dt_id, -- integer version of date 84066 -- users before October 2013 + sum(daily_ct) over ( order by dt rows between unbounded preceding and current row ) as user_ct from daily_new_users ) select dt, daily_ct from daily_new_users; select dt, user_ct from daily_user_volume

Here are the two curves:

The daily_user_volume data has the look of an exponential growth curve, making it an ideal candidate for an exponential projection.

Linearizing Exponential Data

The easiest kind of regression is a linear regression. Of course, fitting a line to exponential data would yield a terrible fit. Instead, we can linearize the exponential data by taking the log of the data fit to a linear regression, and then inverse process on the future data points to get the predicted future growth.

The log’ed version of daily_user_volume is quite linear, so this will be a great fit:

To make a linear regression, we need to find the best estimates for A and B (intercept and slope) that minimize the error in this formula:

We’ll use the Ordinary Least Squares method for minimizing error of our estimates, which lets us solve for A and B like this:

Solving for B first, we’ll define estimate_b as:

with ... estimate_b as ( select sum(covar.s) / sum(var.s) b from ( select ( dt_id - avg(dt_id::float8) over ()) * ( log(user_ct) - avg(log(user_ct)) over () ) as s from daily_user_volume) covar join ( select pow(dt_id - avg(dt_id::float8) over (), 2) as s from daily_user_volume ) var on true ),

Critically, we’re taking the log of user_ct to linearize those exponential data points!

Our window functions use over () so that the window is applied to the whole result set. It’s very convenient in situations like this, where you want to compare each row to an aggregation over every row.

Getting estimate_a is more straightforward:

with ... estimate_a as ( select avg(log(user_ct)) - avg(dt_id::float) * (select b from estimate_b) a from daily_user_volume ),

Now that we have our A and B for the regression, it’s time to project forward.

Projecting and De-linearizing

With the estimates computed, we can simply generate the y-values for the current and future dates and then invert the logarithm! We’ll generate a series of dates that will start with and then exceed the dt in daily_user_volume, use them as x-values to predict the log(y), and invert the logarithm with pow.

with ... predictions as ( select '2013-10-01'::date + i, coalesce(user_ct, 7111884) as user_ct, -- last real count pow(10, (select a from estimate_a) + ( select b from estimate_b ) * i) estimate from -- make more dt_ids for the projection generate_series(1, 275, 1) i left join daily_user_volume on i = daily_user_volume.dt_id ), select * from predictions

Look at that beautiful fit! We’ve fit an exponential curve to our cumulative user counts so that we can project the counts into the future.

Here’s the full SQL for all the steps together: