The GitHub Archive (on BigQuery) has allowed everyone to track GitHub’s pulse since 2011. Last week it introduced some breaking changes, while updating all its history to conform to the new schema introduced in 2015.

https://github.com/igrigorik/githubarchive.org/issues/112#issuecomment-228460698

From Ilya’s announcement:

all gzip archives have been reprocessed and updated (same location as before).

“daily” dataset now contains the full history (2011-now 🎉 ) for each day.

the top-level schema is the same for all days

“payload” field contains the JSON-encoded payload, if provided by the event.

“other” field contains the JSON-encoded content of any other fields not captured by the schema.

new location: day.events_YYYMMMDD → day.YYYMMMDD — the _events prefix is no longer necessary. Update your queries and remove the prefix.

“monthly” and “yearly” datasets have been reprocessed on top of the new “daily” data.

githubarchive.org landing page is updated with new queries and information about the schema.

email addresses published by GitHub have been scrubbed and replaced with a hash of them.

(thanks Arfon Smith!)

A lot of people have been using this data for awesome results. For example, “Changelog Nightly” is a newsletter that “unearths the hottest repos on GitHub before they blow up”. This GitHub Archive change broke their process, but the fix was simple enough:

(the error was “Error: FROM clause with table wildcards matches no table”, since the event_* prefix is not needed or present anymore for the daily tables)

BigQuery is an awesome tool to analyze private and public data. Stay tuned for more awesome public datasets!

UPDATE: And now all of GitHub contents are available on BigQuery

https://medium.com/@hoffa/github-on-bigquery-analyze-all-the-code-b3576fd2b150