Ruby Meets BigQuery: Part 2

This is part of the Playing with Data series.

In my previous post I showed how I used BigQuery to figure out which gems are the most popular based on downloads. I also showed how to figure out which versions of gems were the most popular. But using download data had some drawbacks. One big company with a lot of servers that they frequently update (assuming they don’t vendor gems or use a machine image) could easily skew the numbers. Luckily another source of data is available.

GitHub data in BigQuery

Recently, Google Cloud Platform and GitHub made data from nearly three million open source repositories available on BigQuery. This data provides another way to measure gem popularity. I was excited about this because this gave me another way to measure the popularity of a gem. Instead of just looking at the raw number of times it had been downloaded I could also see how many projects included it in their Gemfile (assuming they had one).

The GitHub Dataset is large (more than 3TB). BigQuery can query the whole dataset quickly. Most of my queries took less than 30 seconds. But with every query I’m examining millions of rows that don’t contain Ruby files unnecessarily. Also, querying the whole dataset can get expensive.

To make my queries slightly faster and significantly cheaper I need to restrict my queries just look at the Gemfiles, Rakefiles, and .rb files. An easy way to do that is to extract just those rows into their own dataset.

Here’s the query that extracts all files named Gemfile, Gemfile.lock, or Rakefile. path I put these in one table in my new dataset.

1 2 3 SELECT * FROM [bigquery-public-data:github_repos.files] WHERE path IN ( ' Gemfile ' , ' Gemfile.lock ' , ' Rakefile ' )

And this query extracts all the .rb files. I’m using the RIGHT command to compare the last three characters of the file path to the string ‘.rb’.

1 2 3 SELECT * FROM [bigquery-public-data:github_repos.files] WHERE RIGHT (path, 3 ) = ' .rb '

These went into another table. I separated the Gemfiles and Rakefiles from the .rb files because their formats are different, so thoy require different queries to analyze.

Querying the dataset

Let’s see how many .rb files were extracted.

1 2 SELECT count (*) num FROM [rb_files]

Count 19861839

So there were 19,861,839 files in the GitHub data set that ended in .rb. Gemfiles are occasionally used by non-ruby projects (for example with Cocoapods) so I expect a sizable number of Gemfiles and Rakefiles as well.

1 2 SELECT count (*) num FROM [gem_rake_contents]

Count 251420

The counts aren’t particularly interesting. I wanted to know what gems were most common. One way to do that is to parse the Gemfiles and extract the gem names. To do this first I need to split the contents of the Gemfile into lines, which can be done with the SPLIT function. After I’ve split the lines I can use REGEXP_EXTRACT to extract the gem names. Finally, because I want to include dependencies as well as specified gems I’m going to run this query on only the Gemfile.lock.

1 2 3 4 5 6 SELECT REGEXP_EXTRACT(line, r " \s *gem \s [' \" ](.*?)[' \" ] " ) as gem FROM ( SELECT SPLIT(content, '

' ) as line FROM github_ruby.gem_rake_contents ) HAVING gem IS NOT NULL

gem sinatra mongoid thin rake bcrypt-ruby rspec rack-test mongoid-rspec simplecov fivemat factory_girl timecop

This gives us the gem names but it doesn’t give us any sort of ranking. It just lists all the gem names as it came to them. To figure out which gems are the most popular I need to do some grouping.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SELECT gem, COUNT (*) AS n FROM ( SELECT REGEXP_EXTRACT(line, r " \s *gem \s [' \" ](.*?)[' \" ] " ) AS gem FROM ( SELECT SPLIT(content, '

' ) AS line FROM github_ruby.gem_rake_contents ) HAVING gem IS NOT NULL ) GROUP BY gem ORDER BY n DESC LIMIT 10

gem n rails 33227 jquery-rails 19709 uglifier 18369 sass-rails 17857 coffee-rails 16145 rspec 15829 rake 15683 therubyracer 14081 pg 14029 unicorn 13696

Conclusion

Using the Rubygems.org download data the most popular gems were rake, rack, multi_json, json, and bundler.

name count rake 214,152,212 rack 201,911,759 multi_json 200,342,260 json 191,430,173 bundler 186,172,479

Using the GitHub data the most popular gems are rails, jquery-rails, uglifier, and sass-rails.

gem n rails 33227 jquery-rails 19709 uglifier 18369 sass-rails 17857 coffee-rails 16145 rspec 15829 rake 15683 therubyracer 14081 pg 14029 unicorn 13696

The most popular gems aren’t particularly consistent across sources. This is likely because some of the gems (rake and json) are default gems installed with every Ruby install. This will cause them to both have higher downloads than any other gem and also to not appear in Gemfiles since they are assumed to be downloaded already.

This discrepancy is why using two or more data sources is a good idea if you are trying to generalize about what is most popular. During elections seasons people making predictions combine information from several polls and use other data to draw their conclusions. We can do the same thing when determining the most popular Ruby gems by combining data from Rubygems.org and GitHub.

If you want to play with the GitHub data you can access it here.