Analyzing GitHub Pull Request Data with BigQuery

Sunday, August 7, 2016

I really enjoyed Felipe Hoffa’s post on Analyzing GitHub issues and comments with BigQuery .

Which got me wondering about my favorite subject ever, The Art of Closing. I wonder what the stats are for the top 15 projects on GitHub in terms of pull requests opened vs. pull requests closed. This post will use the GitHub Archive dataset.

Top 15 repositories with the most pull requests

First let’s find the top 15 repos with the most pull requests from 2015. Let’s make sure to check the payload action is ”opened”.

SELECT repo.name, COUNT(*) c FROM [githubarchive:year.2015] WHERE type IN ( 'PullRequestEvent') AND JSON_EXTRACT(payload, '$.action') IN ('"opened"') GROUP BY repo.name ORDER BY c DESC LIMIT 15

repo_name c openmicroscopy/snoopys-sandbox 11656 brianchandotcom/liferay-portal 10803 Homebrew/homebrew 9519 caskroom/homebrew-cask 6833 apache/spark 6667 saltstack/salt 6636 mozilla-b2g/gaia 6609 jlord/patchwork 6155 GoogleCloudPlatform/kubernetes 5937 jsdelivr/jsdelivr 5747 rust-lang/rust 5559 cms-sw/cmssw 5507 code-dot-org/code-dot-org 5267 docker/docker 5083 NixOS/nixpkgs 4873

Okay that’s a lot of pull requests. Let’s find the projects will the most unique number of pull request authors.

SELECT repo.name, COUNT(*) c, COUNT(DISTINCT actor.id) authors, FROM [githubarchive:year.2015] WHERE type IN ( 'PullRequestEvent') AND JSON_EXTRACT(payload, '$.action') IN ('"opened"') GROUP BY repo.name ORDER BY authors DESC LIMIT 15

repo_name c authors jlord/patchwork 6155 5396 octocat/Spoon-Knife 3966 3741 deadlyvipers/dojo_rules 4847 3076 Homebrew/homebrew 9519 2186 udacity/create-your-own-adventure 2709 2167 caskroom/homebrew-cask 6833 1517 borisyankov/DefinitelyTyped 2694 1127 rails/rails 3100 1012 LarryMad/recipes 1086 989 laravel/framework 2736 891 docker/docker 5083 882 rdpeng/ProgrammingAssignment2 922 866 apache/spark 6667 851 JetBrains/swot 951 836 rust-lang/rust 5559 835

Now let’s see what the merge vs. close numbers look like for those projects.

SELECT repo.name, COUNT(*) c, COUNT(DISTINCT actor.id) authors, SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END) AS merged, SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN 1 ELSE 0 END) AS closed, FROM [githubarchive:year.2015] WHERE type IN ( 'PullRequestEvent') AND JSON_EXTRACT(payload, '$.action') IN ('"closed"') GROUP BY repo.name ORDER BY authors DESC LIMIT 15

repo_name c authors merged closed deadlyvipers/dojo_rules 1636 1022 0 1636 octocat/Spoon-Knife 1103 944 0 1103 jlord/patchwork 6595 705 4905 1690 LarryMad/recipes 588 532 0 588 apache/spark 6653 468 0 6653 Homebrew/homebrew 9548 451 5 9543 udacity/create-your-own-adventure 2765 301 1946 819 rdpeng/ProgrammingAssignment2 341 284 0 341 docker/docker 5250 254 3979 1271 NixOS/nixpkgs 4707 249 3438 1269 odoo/odoo 3412 233 712 2700 borisyankov/DefinitelyTyped 2529 221 2173 356 mozilla-b2g/gaia 7197 215 5251 1946 rails/rails 3254 212 2090 1164 caskroom/homebrew-cask 6928 210 3044 3884

Oh that is super weird. After looking into a few of the repos with 0 merged, it seems they aren’t really using GitHub for merges.

Calculating the merge ratio

So let’s exclude those and try again, this time we can even calculate the merge ratio.

SELECT repo.name, COUNT(*) c, COUNT(DISTINCT actor.id) authors, SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END) AS merged, SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN 1 ELSE 0 END) AS closed, ROUND(100*SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END)/COUNT(*),2) AS merge_ratio FROM [githubarchive:year.2015] WHERE type IN ( 'PullRequestEvent') AND JSON_EXTRACT(payload, '$.action') IN ('"closed"') GROUP BY repo.name HAVING merged > 10 ORDER BY authors DESC LIMIT 15

repo_name c authors merged closed merge_ratio jlord/patchwork 6595 705 4905 1690 74.37 udacity/create-your-own-adventure 2765 301 1946 819 70.38 docker/docker 5250 254 3979 1271 75.79 NixOS/nixpkgs 4707 249 3438 1269 73.04 odoo/odoo 3412 233 712 2700 20.87 borisyankov/DefinitelyTyped 2529 221 2173 356 85.92 mozilla-b2g/gaia 7197 215 5251 1946 72.96 rails/rails 3254 212 2090 1164 64.23 caskroom/homebrew-cask 6928 210 3044 3884 43.94 cms-sw/cmssw 5475 205 4312 1163 78.76 symfony/symfony 2587 185 1387 1200 53.61 facebook/react-native 1563 185 494 1069 31.61 robbyrussell/oh-my-zsh 731 185 307 424 42.0 githubteacher/github-for-developers-sept-2015 404 181 301 103 74.5 nightscout/cgm-remote-monitor 1096 178 419 677 38.23

Using the diff data

Sweet now let’s see on average what the size of the diffs are for these projects’ pull requests.

SELECT repo.name, COUNT(*) c, COUNT(DISTINCT actor.id) authors, SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END) AS merged, SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN 1 ELSE 0 END) AS closed, ROUND(100*SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END)/COUNT(*),2) AS merge_ratio, AVG(JSON_EXTRACT(payload, '$.pull_request.additions')) AS avg_additions, AVG(JSON_EXTRACT(payload, '$.pull_request.deletions')) AS avg_deletions, AVG(JSON_EXTRACT(payload, '$.pull_request.changed_files')) AS avg_changed_files, FROM [githubarchive:year.2015] WHERE type IN ( 'PullRequestEvent') AND JSON_EXTRACT(payload, '$.action') IN ('"closed"') GROUP BY repo.name HAVING merged > 10 ORDER BY authors DESC LIMIT 15

repo_name c authors merged closed merge_ratio avg_additions avg_deletions avg_changed_files jlord/patchwork 6595 705 4905 1690 74.37 47.45595147839272 172.14268385140258 175.20257771038666 udacity/create-your-own-adventure 2765 301 1946 819 70.38 30.39746835443038 13.116455696202532 6.742133815551537 docker/docker 5250 254 3979 1271 75.79 214.36685714285716 115.88342857142857 8.139619047619048 NixOS/nixpkgs 4707 249 3438 1269 73.04 339.9751434034417 40.72678988740174 5.380072232844699 odoo/odoo 3412 233 712 2700 20.87 1626.0741500586166 1907.4182297772568 128.01992966002345 borisyankov/DefinitelyTyped 2529 221 2173 356 85.92 887.0581257413997 864.4827995255041 2.8730723606168445 mozilla-b2g/gaia 7197 215 5251 1946 72.96 415.85396693066554 138.59233013755733 10.55578713352786 rails/rails 3254 212 2090 1164 64.23 54.88414259373079 29.18561770129072 6.880762138905962 caskroom/homebrew-cask 6928 210 3044 3884 43.94 8.448469976905312 4.0329099307159355 3.315675519630485 cms-sw/cmssw 5475 205 4312 1163 78.76 2160.7702283105023 713.1713242009132 37.51086757990868 facebook/react-native 1563 185 494 1069 31.61 189.86756238003838 86.54638515674984 10.595649392194497 robbyrussell/oh-my-zsh 731 185 307 424 42.0 54.0328317373461 11.285909712722297 1.987688098495212 symfony/symfony 2587 185 1387 1200 53.61 142.36722071897952 168.96366447622728 28.32006184770004 githubteacher/github-for-developers-sept-2015 404 181 301 103 74.5 18.217821782178216 0.7574257425742574 2.517326732673267 nightscout/cgm-remote-monitor 1096 178 419 677 38.23 519.7043795620438 246.9434306569343 8.777372262773723

Well that’s not all that interesting…

Can we prove you should always keep your pull requests small?

We know that it is always better to make a small pull request to have it merged. Let’s see if we can prove that with data!

SELECT repo.name, COUNT(*) c, COUNT(DISTINCT actor.id) authors, ROUND(100*SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END)/COUNT(*),2) AS merge_ratio, AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN JSON_EXTRACT(payload, '$.pull_request.additions') END) AS merged_avg_additions, AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN JSON_EXTRACT(payload, '$.pull_request.deletions') END) AS merged_avg_deletions, AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN JSON_EXTRACT(payload, '$.pull_request.changed_files') END) AS merged_avg_changed_files, AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN JSON_EXTRACT(payload, '$.pull_request.additions') END) AS closed_avg_additions, AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN JSON_EXTRACT(payload, '$.pull_request.deletions') END) AS closed_avg_deletions, AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN JSON_EXTRACT(payload, '$.pull_request.changed_files') END) AS closed_avg_changed_files, FROM [githubarchive:year.2015] WHERE type IN ( 'PullRequestEvent') AND JSON_EXTRACT(payload, '$.action') IN ('"closed"') GROUP BY repo.name HAVING merge_ratio > 5 ORDER BY authors DESC LIMIT 15

repo_name c authors merge_ratio merged_avg_additions merged_avg_deletions merged_avg_changed_files closed_avg_additions closed_avg_deletions closed_avg_changed_files jlord/patchwork 6595 705 74.37 9.3565749235474 0.033231396534148826 1.0014271151885832 158.03431952662723 671.6674556213018 680.798224852071 udacity/create-your-own-adventure 2765 301 70.38 7.863309352517986 0.6747173689619733 1.8144912641315518 83.94017094017094 42.67887667887668 18.45054945054945 docker/docker 5250 254 75.79 176.0874591605931 90.80949987434029 5.965317919075145 334.2045633359559 194.38001573564122 14.946498819826909 NixOS/nixpkgs 4707 249 73.04 137.50581733566025 31.841768470040723 2.91564863292612 888.5090622537431 64.79826635145784 12.056737588652481 odoo/odoo 3412 233 20.87 200.9129213483146 195.0870786516854 7.095505617977528 2001.8944444444444 2358.966296296296 159.90814814814814 borisyankov/DefinitelyTyped 2529 221 85.92 390.8085595950299 482.45467096180397 2.1339162448228257 3916.13202247191 3196.3567415730336 7.384831460674158 mozilla-b2g/gaia 7197 215 72.96 398.6246429251571 86.15311369262997 6.51628261283565 462.3448098663926 280.09198355601234 21.45580678314491 rails/rails 3254 212 64.23 23.657416267942583 11.615789473684211 2.6382775119617223 110.95274914089347 60.732817869415804 14.49828178694158 caskroom/homebrew-cask 6928 210 43.94 8.201708278580815 5.042706964520368 3.9244415243101183 8.641864057672503 3.241503604531411 2.8385684860968072 cms-sw/cmssw 5475 205 78.76 994.2810760667903 619.9148886827459 8.133812615955472 6485.7067927773005 1058.9337919174548 146.43078245915734 symfony/symfony 2587 185 53.61 63.3914924297044 72.16582552271089 8.235760634462869 233.65 280.84583333333336 51.534166666666664 facebook/react-native 1563 185 31.61 204.29757085020242 88.43522267206478 8.024291497975709 183.19925163704397 85.67352666043031 11.783910196445277 robbyrussell/oh-my-zsh 731 185 42.0 49.74267100977199 10.824104234527688 1.6612377850162867 57.139150943396224 11.620283018867925 2.224056603773585 githubteacher/github-for-developers-sept-2015 404 181 74.5 4.700996677740863 0.4186046511627907 1.1727574750830565 57.71844660194175 1.7475728155339805 6.446601941747573 nightscout/cgm-remote-monitor 1096 178 38.23 173.24582338902147 58.885441527446304 4.985680190930788 734.1299852289512 363.3338257016248 11.124076809453472

IT IS PROVEN!!!