What countries have more open source developers per capita than the US?

We are going to use GitHub Archive, GHTorrent, population statistics and BigQuery to find the answer.

This post was inspired by Silver Keskkula’s cool post analyzing GitHub Archive with BigQuery and asking “I jumped into this hoping to learn something about the changing geography of git pushes for Teleport, but it looks like the Github Archive guys have not made the location field on push history available […] but wouldn’t it be cool?”. I hope you like the answer Silver!

Top countries by number of pushes Aug 2016

It’s no surprise to see the US as the top country with the number of most pushes into GitHub, but it would be more interesting to normalize this number by country population. Let’s do that next.

country pushes

---------------- ---------

United States 1709062

Hungary 428786

Germany 319787

United Kingdom 304752

China 250708

France 151529

Canada 143815

India 109823

Japan 108934

Brazil 104802

Russia 101331

Australia 90174

Netherlands 75198

Spain 60553

Switzerland 53652

Ukraine 51811

Sweden 51584

Italy 46350

Poland 44280

Estonia 36515

Taiwan 35856

Belgium 32486

Greece 30541

Norway 29659

Mexico 27956

Top countries by number of pushes / population

The US now shows up in 6th place. Hungary, Estonia, Switzerland, Norway, and New Zealand had more pushes per capita than the US. But what if we look at the pushes by unique users instead? That, on the next section.

country pushes pop pushes/pop * 1000

---------------- --------- ----------- -------------------

Hungary 428786 9982000 42.955920660

Estonia 36515 1291170 28.280551750

Switzerland 53652 7581000 7.077166601

Norway 29659 5009150 5.920964635

New Zealand 24077 4252277 5.662142894

United States 1709062 310232863 5.508965051

Luxembourg 2693 497538 5.412651898

Sweden 51584 9555893 5.398134952

Iceland 1645 308910 5.325175617

United Kingdom 304752 62348447 4.887884377

Denmark 24939 5484000 4.547592998

Netherlands 75198 16645000 4.517753079

Canada 143815 33679000 4.270168354

Finland 22097 5244000 4.213768116

Australia 90174 21515754 4.191068554

Germany 319787 81802257 3.909268665

Ireland 14956 4622917 3.235186788

Belgium 32486 10403000 3.122753052

Austria 24734 8205000 3.014503352

Greece 30541 11000000 2.776454545

Czech Republic 25370 10476000 2.421725850

Latvia 5216 2217969 2.351701038

France 151529 64768389 2.339551784

Slovenia 4331 2007000 2.157947185

Israel 14189 7353985 1.106882867

Top countries by number of unique ids / population

Finally we discover that Iceland is the country with more GitHub users per capita contributing to GitHub during the last month. The US goes even further down the ranking, with NZ, Sweden, Switzerland, Norway, Finland, Denmark, Netherlands, Canada, the UK, Australia, Luxembourg, Estonia, and Ireland showing a larger presence of developers per capita.

country population uniques uniques/pop * 10000

---------------- ------------ --------- ---------------------

Iceland 308910 149 4.823411350

New Zealand 4252277 1370 3.221803283

Sweden 9555893 3014 3.154074664

Switzerland 7581000 2322 3.062920459

Norway 5009150 1509 3.012487149

Finland 5244000 1412 2.692601068

Denmark 5484000 1472 2.684172137

Netherlands 16645000 4277 2.569540403

Canada 33679000 7865 2.335283114

United Kingdom 62348447 14361 2.303345262

Australia 21515754 4896 2.275541912

Luxembourg 497538 111 2.230985372

Estonia 1291170 282 2.184065615

Ireland 4622917 1009 2.182604620

United States 310232863 65703 2.117860737

Belgium 10403000 1615 1.552436797

Austria 8205000 1230 1.499085923

Germany 81802257 11770 1.438835606

Czech Republic 10476000 1401 1.337342497

Slovenia 2007000 260 1.295465869

France 64768389 7852 1.212319794

Israel 7353985 814 1.106882867

Lithuania 2944459 298 1.012070469

Bulgaria 7148785 706 0.987580407

Latvia 2217969 219 0.987389814

How-to

Step 1: Find the data

GitHub Archive has a full timeline of GitHub’s events, but it doesn’t show us where the users are. The GitHub on BigQuery dataset snapshots the contents of GitHub, but doesn’t have this data either (though we can see timezones). Turns out GHTorrent has this data, and we just need to load it into BigQuery to get started.

bq load --max_bad_records 10000 --replace \

fh-bigquery:ghtorrent.users \

gs://mybucket/ghtorrent/users.csv \

id,login,company,created_at,type,fake,deleted,long,lat,country_code,state,city,location (33s)

The slowest part was downloading and uncompressing GHTorrent, but as I shared the September snapshot, you can skip this step.

Note that only ~9% of GitHub users have a country code on GHTorrent.

SELECT SUM(country_code!='\\N')/COUNT(*)

FROM [fh-bigquery:ghtorrent.users] b 0.0916

For normalization purposes, I’ll only look at countries with a population larger than 300,000.

Licenses

Note that we are using 2 datasets with an CC SA license, hence our results will need to respect its terms:

Step 2: Count the number of pull events during August by country:



FROM [githubarchive:month.201608] a

JOIN [fh-bigquery:ghtorrent.users] b

# http://ghtorrent.org/faq.html

ON a.actor.login=b.login

WHERE country_code != '\\N'

AND a.type='PushEvent'

GROUP BY 1

ORDER BY 2 DESC SELECT country_code, COUNT(*) pushes,FROM [githubarchive:month.201608] aJOIN [fh-bigquery:ghtorrent.users] bON a.actor.login=b.loginWHERE country_code != '\\N'AND a.type='PushEvent'GROUP BY 1ORDER BY 2 DESC 3.2s elapsed, 832 MB processed

Step 3: Bring in population statistics to normalize



FROM [githubarchive:month.201608] a

JOIN [fh-bigquery:ghtorrent.users] b

# http://ghtorrent.org/faq.html

ON a.actor.login=b.login

JOIN (

SELECT LOWER(iso) iso, population, country

FROM [gdelt-bq:extra.countryinfo]

WHERE population > 300000

) c

# http://download.geonames.org/export/dump/readme.txt

ON c.iso=b.country_code

WHERE country_code != '\\N'

AND a.type='PushEvent'

GROUP BY 1

ORDER BY ratio_pushes DESC

LIMIT 300 SELECT c.country, COUNT(*) pushes, FIRST(c.population) population, 1000*COUNT(*)/FIRST(c.population) ratio_pushesFROM [githubarchive:month.201608] aJOIN [fh-bigquery:ghtorrent.users] bON a.actor.login=b.loginJOIN (SELECT LOWER(iso) iso, population, countryFROM [gdelt-bq:extra.countryinfo]WHERE population > 300000) cON c.iso=b.country_codeWHERE country_code != '\\N'AND a.type='PushEvent'GROUP BY 1ORDER BY ratio_pushes DESCLIMIT 300 3.3s elapsed, 832 MB processed

Step 4: Get the number of unique GitHub users per country too



FROM [githubarchive:month.201608] a

JOIN [fh-bigquery:ghtorrent.users] b

# http://ghtorrent.org/faq.html

ON a.actor.login=b.login

JOIN (

SELECT LOWER(iso) iso, population, country

FROM [gdelt-bq:extra.countryinfo]

WHERE population > 300000

) c

# http://download.geonames.org/export/dump/readme.txt

ON c.iso=b.country_code

WHERE country_code != '\\N'

AND a.type='PushEvent'

GROUP BY 1

ORDER BY ratio_unique_login DESC

LIMIT 300 SELECT c.country, COUNT(*) pushes, FIRST(c.population) population, 1000*COUNT(*)/FIRST(c.population) ratio_pushes, 10000*COUNT(DISTINCT login)/FIRST(c.population) ratio_unique_loginFROM [githubarchive:month.201608] aJOIN [fh-bigquery:ghtorrent.users] bON a.actor.login=b.loginJOIN (SELECT LOWER(iso) iso, population, countryFROM [gdelt-bq:extra.countryinfo]WHERE population > 300000) cON c.iso=b.country_codeWHERE country_code != '\\N'AND a.type='PushEvent'GROUP BY 1ORDER BY ratio_unique_login DESCLIMIT 300 2.9s elapsed, 832 MB processed

Interactive charts

Shared on a Google Sheet.

More?

Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.