Real World Data

While usually less docile than our simulated distributions, real world datasets often resemble at least one of the four classes above.

Normal distributions — the ballyhooed ‘bell curve’ — arise most often in natural & biological organisms & scenarios with few interactions. Height & weight are canonical examples. As such, my first instinct is to reach for the trusty iris dataset. It meets the requirement, but there’s just something underwhelming about n = 50 (the number of observations of a single species of flower in the dataset). I’m thinking bigger.

So let’s load the bigrquery package, & do bigRqueries.

library(bigrquery)

Google’s BigQuery makes available numerous public datasets of real data, some quite large, from genomics to patents to wikipedia article stats.

For our initial purposes, the natality dataset appears sufficiently biological:

project <- “YOUR-PROJECT-ID” sql <- “SELECT COUNT(*)

FROM `bigquery-public-data.samples.natality`” query_exec(sql, project = project, use_legacy_sql = F)

(protip: there’s lots of data here, & you get charged by the amount of data accessed, but your first TB per month is free. Further, while SELECT * is highly discouraged for obvious reasons, SELECT COUNT(*) is actually a free operation, & a good idea to scope things out)

Output:

0 bytes processed

f0_

1 137826763

That’s more like it, 137 million baby records. We don’t quite need all of those, so let’s randomly sample 1% of the baby weights, take the first million results & see what we get:

sql <- “SELECT weight_pounds

FROM `bigquery-public-data.samples.natality`

WHERE RAND() < .01” natal <- query_exec(sql, project = project, use_legacy_sql = F,

max_pages = 100)

Output:

Running job /: 7s:1.0 gigabytes processed

Warning message:

Only first 100 pages of size 10000 retrieved. Use max_pages = Inf to retrieve all.

hist(natal$weight_pounds) yields:

Normal af imo.

Now to find some multiplicative data with some skew, let’s move beyond the biological, to the sociological.

We’ll look at the New York dataset, which contains all sorts of urban info, including trips in yellow & green cabs.

sql <- “SELECT COUNT(*)

FROM `nyc-tlc.green.trips_2015`” query_exec(sql, project = project, use_legacy_sql = F)

Output:

0 bytes processed

f0_

1 9896012

Under 10 million, so lets grab trip distances for the whole thing:

(this can take a while)

sql <- "SELECT trip_distance FROM `nyc-tlc.green.trips_2015`" trips <- query_exec(sql, project = project, use_legacy_sql = F) hist(trips$trips_distance)

-_-

Looks like there are some extreme outliers pulling our x-axis out to 800 miles. Hell of a cab ride. Lets trim this down to trips under 20 miles & take another look:

trips$trip_distance %>% subset(. <= 20) %>% hist()

There we are, & with a telltale sign of a lognormal distribution: the long tail. Let’s check our data for lognormalcy by plotting the histogram of the logs:

trips$trip_distance %>% subset(. <= 20) %>% log() %>% hist()

Normalish for sure, but we overshot the mark a bit & now observe a bit of left-skew. Alas, the real world strikes again. But safe to say that a lognormal distribution wouldn’t be a completely preposterous model to employ here.

Moving on. Let’s find some more heavy-tailed data, this time in a domain closer to my own professional affinities of digital web analytics: the Github dataset:

sql <- "SELECT COUNT(*)

FROM `bigquery-public-data.samples.github_nested`" query_exec(sql, project = project, use_legacy_sql = F)

Output:

0 bytes processed

f0_

1 2541639

2.5 million rows. Since I’m starting to worry about my local machine’s memory, we’ll employ our random sampler & million row delimiter to get a count of “watchers” of each github repository:

sql <- “SELECT repository.watchers

FROM `bigquery-public-data.samples.github_nested`

WHERE RAND() < .5” github <- query_exec(sql, project = project, use_legacy_sql = F,

max_pages = 100) github$watchers %>% hist()

Again, extreme long tail dynamics, so let’s do some trimming:

github$watchers %>% subset(5 < . & . < 3000) %>% hist()

Exponential af.

But is it lognormal again?

github$watchers %>% subset(5 < . & . < 3000) %>% log() %>% hist()

Nay.

But behold the rare beast: the (approximately) LOGUNIFORM DISTRIBUTION!

Let’s take one more draw from the big data urn, this time looking at scores of Hacker News posts:

sql <- “SELECT COUNT(*)

FROM `bigquery-public-data.hacker_news.full`” query_exec(sql, project = project, use_legacy_sql = F)

Output:

0 bytes processed

f0_

1 16489224

…

sql <- “SELECT score

FROM `bigquery-public-data.hacker_news.full`

WHERE RAND() < .1” hn <- query_exec(sql, project = project, use_legacy_sql = F,

max_pages = 100) hn$score %>% hist()

…

hn$score %>% subset(10 < . & . <= 300) %>% hist()

A slower decay (once trimmed). And as for the log-transform?

hn$score %>% subset(10 < . & . <= 300) %>% log() %>% hist()

Again roughly loguniform with a rightward decay.

My cursory search for a power law distribution in the wild proved fruitless, which is perhaps not surprising as they are most often posited in network science (&, even there, appear to be scarcer than initially claimed).

In any event, lets organize our real world datasets as we did with our simulated distributions & plot them comparatively. We’ll normalize them & again center them around 100:

Now to plot: