Photo by Luca Bravo on Unsplash

The Internet provides an enormous amount of data available for analysing. Sometimes it is even in already easy-to-use form, like, for example, collections of all GitHub repositories in Google BigQuery datasets. Once I thought what are the most used functions in R that contributers on GitHub use.

BigQuery dataset of all public Github repositories contains over 3 TB of data. Hopefully we don’t need everything, but only R files. We will first extract IDs of R files in bigquery-public-data:github_repos.files , and then subsequently use these IDs to extract files’ content from bigquery-public-data:github_repos.contents.

Selecting ids of all R files:

SELECT *

FROM [bigquery-public-data:github_repos.files]

WHERE lower(RIGHT(path, 2)) = '.r'

Selecting content of R files:

SELECT *

FROM [bigquery-public-data:github_repos.contents]

WHERE id IN (select id from [bigquery-github-1383:Github.r_files])

The content is there! We can, of course, continue here, but remember that Google BigQuery is not free and I have already used around 2 TB of data processing. 1 TB is currently $5, but you can save money if you register for Google cloud first time and receive $300 for 1 year. You can also limit the maximum amount of data processed, so one will not run into problems in case of wrong functions.

Since we will need to pull data from R as well, and I will use it later for visualizations, I will switch to R, not to run from BigQuery to R back and forth.

The final files are 1.2 TB of Data. I could download it, but they will be first placed on Google cloud storage and only then will become available for downloading, so instead I will establish a connection from R to BigQuery using bigrquery package in R for that.

big_query_data <- "certain-torus-206419" sql <- "SELECT content FROM [certain-torus-206419:Github.content_copy]" destination_table <- "certain-torus-206419:Github.r_2" content_r <- query_exec(sql, project = big_query_data, useLegacySql = FALSE, destination_table = destination_table, max_pages = Inf)

The package will handle authorisation on its own and it took only 2 minutes to download and process data.

Most used R packages and their functions

Once we’ve got the content we can start finding functions. There are 2 ways – find functions directly in the GitHub files or find them somewhere else and check their appearance in GitHub codes.

There is a problem with the first approach – the same name is sometimes used for different functions in different packages and it might be hard for us to distinguish between them after. So I will find all functions with their packages first, and then check their appearance in the GitHub files.

The first idea that came to me was to simply run apropos(“”) to list all functions. This function, however, will list only functions in your environment, which might not be complete. So, I decided to start with packages. I went to see all CRAN packages. I used Google Sheets and this formula ( =ImportHtml(URL, "table", num) to import date from the webpage, listing all packages:

R has more than 12 000 packages! Great for R, not for me. Of course, we can try listing all functions, but I would go for optimisation from this point. It’s not very likely that we will find some of 100 most popular functions in rarely used packages. So, I would limit them. Let’s first find top 100 R packages and functions in them. Hopefully, this was previously done by Kan Nishida here. I will use a similar approach, however, with more native R functions instead of the exploratory’ package. I also used slightly different syntax for finding packages to include cases with whitespaces after the bracket. It helped to bring around 1500 more appearances of packages in the code.

I found top 100 functions. It is actually interesting, how did that change over last 2 years.

Not that much, top4 didn’t change at all, and I don’t see drastic changes for other packages as well.

Once we know the packages it is time to get all functions within these packages. Before that, I need to ensure that these packages are installed and uploaded in my environment.

Installation is a tricky part, I wanted to be in control of what do I install, so I went one by one instead of loading them all in once using a function. After this unfortunately a manual step is done, I will upload them to the environment using:

lapply(packages_100$value, require, character.only = TRUE)

And then get all installed packages and functions by:

functions <- sapply(search(), ls)

I must say, I didn’t manage to install all 100 packages with this attempt. The only reason is that some packages are not CRAN packages or run for R 2nd version (I am using 3.). Hopefully, all of top50 packages are there, so I think neglecting some other is not a problem. Moreover, I get a bit suspicious when a package was deleted from CRAN for some reasons (like, for example limma).

Anyway, after clearing for Global Environment and datasets, we have 99 packages, which result in over 12 thousand functions. Quite good! I can filter the original dataset with content of R files to include only these packages.

Identifying functions

Now I have two options. I could either try to detect all words that could look like a function ( in R function can include upper/lowercase letters _ . and numbers and I will exclude operators) and then search for the same functions in the list, or try to pull functions which are followed by “(“. The first approach will result in higher appearance for functions which are similar to words in English. The second one will, however, neglect usage of functions inside other functions like apply() where you include functions without bracket. Both approaches might be biased. So, I will run both on a smaller subset and see if any approach is better.

I run 2 methods on a subset of 10000 rows and then brought up results where the difference in count is higher than 20%:

The results are in favour of the second methods – most functions which have higher rankings with the first method are either one-letter functions, “a” article, common words or have the same name as the package(which means they will be counted twice as least). The only issue that comes up is h2o package. It is pretty strange because most functions there are in standard form with a bracket afterwards. But let’s run the same script on next 10000 rows:

Everything is fine now, so the problems with h2o package was only a bias of non-representative subset. Thus, 2nd approach derives more accurate results.

Finding most used functions

Another thing is the difference between the base package and the others: when using base functions one normally does not explicitly load the base package. So I will run them separately.

There’s also a performance issue – our data set is pretty big – around 1.2 GB. Checking all functions is a more extensive and will slow the process. Checking separately for packages will result in less processing.

To process is as follows:

- Extract all supposedly functions from strings,

- Unnest column contaiting functions,

- Group by package and functions,

- Count appearance of every instance,

- Filter on instances that appear in our list of all functions (using inner join on package and function names).

The result for the first 50 functions is there:

Surprises? Not really, we have already checked that ggplot2 is the most common package.

I will repeat the process for the base functions.

Not a lot of surprises here as well.

Assumptions

I need to mention a couple of issues that I did not take into account:

- Functions inside self-made functions are counted only once. This is questionable, but I believe it will not cause a big difference. If I wanted, however, I would need to count function(x){} and make a list of functions used inside, count the usage of the new function in a file, then count the usage of the new function minus 1 and add to the totals of the original function. Here I believe this complexity will not gain better results. Moreover, one can argue that counting only once is more logical, since the function is actually used only once.

- I did not count if somebody rewrite the behaviour of a function. Normally, you should not do that and I hope really few did.

- I relied on the assumption that if someone uses a function from not the base package, she will explicitly add the package to the environment. However, this is not always the case. Running another script on test 10000 rows without filtering on the package name, comparing the filtering on package and deriving difference shows us that the ggplot2 package, for example, was not loaded in up to 70% of files.

Well, not the best practice. I also believe that the distribution of non-loaded packages might tend toward most used packages. Thus, the result is probably biased, but it helps to bring up less famous packages.

Moreover, it shows that by we luckily avoided the problem of assigning popular functions used in many packages:

Results

So here we are! 100 most used on GitHub R functions from all packages are:

The absolute majority of them are base functions, which is somehow expected, with just a number of functions from other packages.

I have also uploaded a file with 2000 most used functions on GitHub (top_2000_functions.csv), so you don’t need to run all the code to explore them.

All code is on GitHub.