The Big Bang of the Reddiverse: growth in posts per day, broken down by subreddit

June 20th, 2016

Ever since stumbling across this awesome dataset of all Reddit submissions from about 2006 to August 2015, I've been trying to find neat ways of visualizing such a vast amount of data.

I wanted to look at Reddit's growth over time. I already made a simpler plot of posts per day across all of Reddit, and I thought it would be cool to break this down by subreddit. Not just a few of my favorite subreddits, no: all 430 thousand of them.

The dataset stretches from about 2008 (data before that is incomplete) to August 2015. That's 430,434 subreddits over 3,507 days, otherwise know as a shit-ton (metric, mind you) of data. Calculating the daily number of submissions per subreddit is easy enough, but that's still 430,434 * 3507 ≈ 1.5 billion datapoints. That's not going to fit into a single plot without downsampling away 99.9% of the detail. That's not a figure of speech: if I gave each datapoint only 1 pixel in a 1920x1080 image, only 1920 * 1080 / 1.5 billion ≈ 0.1% of them would fit (using a 4K monitor wouldn't change much: 3840 * 2160 / 1.5 billion ≈ 0.5%). My solution? Use lots of images in a sequence; more colloquially known as a 'video'.

In the video below, each pixel represents a subreddit, and brightness represents daily amount of posts. I've sorted the subreddits by age, with the oldest ones sitting in the center and the youngest ones at the edges.

Sorry, your browser does not support HTML5 video.

Note that I set brightness to max out at 10 submissions per day so that ludicrously popular subreddits like /r/pics don't drown out the rest. It does turn it into kind of an all-or-nothing affair for the innermost subs, but it shows a lot more interesting detail than setting the visual maximum to the real maximum of the data (I've tried).

Looking at the video, most subreddits are barren; part of the vast black void separating the far and few between that shine. They spawn with a flash of activity (as seen from the bright ring of newly born subreddits snaking its way around), but they don't last long. Some die after holding on for a few days (like most timeline subreddits), some flicker on and off like a broken light bulb, and almost none are ever revived once dead. It is only every so often that a subreddit is born and establishes a community, becoming a permanently shining star in the Reddit universe.

It's clear older subreddits tend to be more active: there's a bright core at the center of the Reddiverse. There are also some weird flashes from time to time of groups of subreddits becoming very active for a short period of time, most notably near the end of the video. I've dug into the data and identified some of the things that caught my eye:

Interestingly, when you do plot this data in a single image (heavily downsampled), it naturally shows how the number of subreddits grows over time, with daily posts as a sort of bonus in the colormap.

Leaving out the reference scale for the colormap is intentional: since the data is heavily downsampled across the y-axis (subreddits), each value is the average of a few hunderd neighboring (in terms of age) subreddits, so the absolute values have no real meaning anymore. The purpose of the color in this plot is mainly to show large trends between older and newer subreddits.

I'm not sure if this is the best way to show this data. Probably not, depending on what question you're trying to answer. But at the very least, it's a good reminder that line plots, bar charts, and networked node graphs aren't the only visualization options out there. Get creative!