Analyzing GitHub issues and comments with BigQuery

(Docker, Kubernetes, TensorFlow and other top projects)

Some people have asked “can we analyze GitHub issues on BigQuery?”. Good news: While that data is not part of the new open source GitHub code on BigQuery dataset, you can find this in the classic GitHub Archive dataset.

Let’s quickly find the 10 repos with the most comments in their issues:

SELECT repo.name, COUNT(*) c

FROM [githubarchive:month.201606]

WHERE type IN ( 'IssueCommentEvent')

GROUP BY repo.name

ORDER BY c DESC

LIMIT 10

GitHub repos with the most comments on issues June 2016

Kubernetes, Spark, OpenShift, Docker, … makes sense — except what is that “sauron-demo” project? Let’s look instead at how many different people are commenting on these projects:

SELECT repo.name, COUNT(*) c, COUNT(DISTINCT actor.id) authors

FROM [githubarchive:month.201606]

WHERE type IN ( 'IssueCommentEvent')

GROUP BY repo.name

ORDER BY authors DESC

LIMIT 10

GitHub repos with more authors commenting on issues June 2016

The ranking has changed: Docker was the repo with more people commenting on issues, followed by Font Awesome, React Native, VS Code, Angular, Kubernetes, npm, TensorFlow and Ansible. In this group Kubernetes is the highlight — it’s the one with more comments per user. Let‘s dig deeper into the number of comments by author:

SELECT repo.name,

ROUND(COUNT(*)/COUNT(DISTINCT actor.login),2) comments_per_author,

EXACT_COUNT_DISTINCT(actor.login ) authors,

COUNT(*) comments

FROM [githubarchive:month.201606]

WHERE type IN ('IssueCommentEvent')

GROUP BY 1

HAVING authors>400

ORDER BY 2 DESC

LIMIT 10

GitHub projects with more comments per author within top commented projects

That’s a huge difference! Within projects that have more than 400 active commenters, Kubernetes records ~35 comments per author, while other projects don’t come close.

Turns out the Kubernetes repo has 2 bots that comment a lot. I wonder what happens when we remove every author that posted more than 1000 comments during this period:

SELECT repo.name,

ROUND(COUNT(*)/EXACT_COUNT_DISTINCT(actor.login),2) comments_per_author,

EXACT_COUNT_DISTINCT(actor.login ) authors,

COUNT(*) comments

FROM [githubarchive:month.201606]

WHERE type IN ('IssueCommentEvent')

AND actor.login NOT IN ( SELECT actor.login FROM (

SELECT actor.login, COUNT(*) c

FROM [githubarchive:month.201606]

WHERE type IN ('IssueCommentEvent')

GROUP BY 1

HAVING c>1000

ORDER BY 2 DESC

))

GROUP BY 1

HAVING authors>400

ORDER BY 2 DESC

LIMIT 10

GitHub projects with more comments per author within top commented projects (removing bots)

Kubernetes is still the top one! Now we can safely declare that the projects with the most active authors (for projects with more than 400 people commenting) are Kubernetes, TensorFlow, and Docker.

2017–02 Standard SQL update:

Top 10 big projects on GitHub by user engagement — 2017–02.

#standardSQL

SELECT repo.name,

ROUND(COUNT(*)/COUNT(DISTINCT actor.login),2) comments_per_author,

COUNT(DISTINCT actor.login ) authors,

COUNT(*) comments

FROM `githubarchive.month.201702`

WHERE type IN ('IssueCommentEvent')

AND actor.login NOT IN ( SELECT login FROM (

SELECT actor.login, COUNT(*) c

FROM `githubarchive.month.201702`

WHERE type IN ('IssueCommentEvent')

GROUP BY 1

HAVING c>1000

ORDER BY 2 DESC

))

GROUP BY 1

HAVING authors>400

ORDER BY 2 DESC

LIMIT 10

How efficient are these projects at closing issues?

Let’s look at who’s doing the best job at closing issues within these 3 top projects: