If you have been following Google’s cloud platform, you are no stranger to BigQuery. In my opinion BigQuery is the most differentiating tool that Google has in its arsenal. With a petabyte scale warehousing capabilities, hundreds of open source data-sets and a familiar SQL interface the entry barrier for developers is really low.

Past week I was browsing through the public data sets those are available in BigQuery when I happen to stumble upon the Github data set. It has commits data, programming languages data, repositories content data and lot of other cool tables.

After going through the entire data set I could think of a lot of analytical queries that I could run on the data set to unlock some genius insights. For example, I could use the repositories content data set to find out what is the most popular Java package, using commits data set I could find out which author has pushed most number of commits or I could use languages data set to find out the top programming languages based on the repository counts ( which happens to be the topic of this post).

On our way to answer the titular question we’ll learn three concepts in BigQuery.

UNNEST() and REPEATED data type Windows and Rank() function Named Sub-queries

Setup

First lets add the data set to our BigQuery console. Navigate to BigQuery in the GCP web console.

On the left panel click on Add Data > Pin a Project.

Enter the project name bigquery-public-data. Click on pin and you should see the project pinned in the pane.

You can also go to the data set page here and click on view data set and that should also add the project to your BigQuery dashboard.

This project contains all the public data sets that BigQuery hosts and you need to navigate to github_repos data set. In it you’ll see 9 tables.

Public Github data set in BigQUery

For the purpose of this post we’ll be using languages table, which has following schema.

Schema for languages table

Lets us query the records to get a feel of the data.

SELECT * FROM `bigquery-public-data.github_repos.languages` LIMIT 1000

You’ll notice that the language field in the table is a nested repeated column of type RECORD (also known as STRUCT) and rightly so because one repository can have code written in multiple languages. You can read more about them here.

This bring us to our first concept of working with repeated (array) types.

UNNEST() and REPEATED data type

In order to use the language field first we’ll need to flatten this column so that instead of getting all the values of that field in one row, we can get each value in separate row. We’ll use UNNEST() function which does exactly what its named, it un-nests the nested records into separate rows.

SELECT

repo_name,

arr.name AS LANGUAGE,

arr.bytes AS language_bytes

FROM

`bigquery-public-data.github_repos.languages`,

UNNEST(LANGUAGE) arr

LIMIT

10

With each language records there is language name and also language bytes which stores that how many bytes of code in a repository is written in any particular language. In this post we’ll tag a repository with the language with maximum number of bytes in code, which brings us to next concept of windows and ranking.

Windows and Rank() function

For each repository we want to find out which language has maximum bytes and we’ll achieve this by ranking languages within each repository ordering them by bytes of code. We’ll be using earlier query as sub query to build our query.

SELECT

repo_name,

LANGUAGE,

RANK() OVER (PARTITION BY t1.repo_name ORDER BY t1.language_bytes DESC) AS rank

FROM (

SELECT

repo_name,

arr.name AS LANGUAGE,

arr.bytes AS language_bytes

FROM

`bigquery-public-data.github_repos.languages`,

UNNEST(LANGUAGE) arr ) AS t1

LIMIT

10

This will add a rank column to our result and we need to select rows with rank=1 as those rows represents the top language for each repository.

SELECT

t2.repo_name,

t2.LANGUAGE

FROM (

SELECT

repo_name,

LANGUAGE,

RANK() OVER (PARTITION BY t1.repo_name ORDER BY t1.language_bytes DESC) AS rank

FROM (

SELECT

repo_name,

arr.name AS LANGUAGE,

arr.bytes AS language_bytes

FROM

`bigquery-public-data.github_repos.languages`,

UNNEST(LANGUAGE) arr ) AS t1 ) AS t2

WHERE

rank = 1

LIMIT

10

Now we have our data set is prepped to find the top most language as per repositories count. We’ll create a named query which can be used in subsequent queries and sub-queries for better readability.

Named Sub-queries

BigQuery supports the WITH keyword to create named queries which help you to SELECT from them. The named query is not materialized and the entire query is computed every time it is reference. You can read more about WITH keyword in the official documentation here.

WITH

repositories AS (

SELECT

t2.repo_name,

t2.LANGUAGE

FROM (

SELECT

repo_name,

LANGUAGE,

RANK() OVER (PARTITION BY t1.repo_name ORDER BY t1.language_bytes DESC) AS rank

FROM (

SELECT

repo_name,

arr.name AS LANGUAGE,

arr.bytes AS language_bytes

FROM

`bigquery-public-data.github_repos.languages`,

UNNEST(LANGUAGE) arr ) AS t1 ) AS t2

WHERE

rank = 1)

If you copy paste this query you’ll get a SYNTAX ERROR, because you can not have a named sub query without using it in any subsequent SELECT statement or expression.

This query will return two columns repository name (repo_name) and its language.

Finding top repositories by count should be a piece of cake from here.

The query should result in following result set.

Top 10 programming languages by repository counts on Github

The result of this query is not shocking, with the emergence of web apps and SPAs JavaScript is topping the charts.

And there you have it. Top 10 programming languages by repository counts.

If you find any bug in the code or have any question feel free to drop a comment below.

Till then happy coding.