All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:

Update: I know I said all — but it’s not all. I’m updating the answers to these and other questions at github.com/fhoffa/analyzing_github.

The pipeline mirrors code from:

Projects that have a clear open source license.

Forks and/or un-notable projects not included.

Nevertheless, it represents terabytes of code.

Official sources:

In depth analysis

I’m waiting for your contributions — I will add them here:

A series of posts by Robert Kozikowski:

Tips

Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].

How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.

I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.

Visualizations