What are the top 100 (most downloaded) R packages in 2013? Thanks to the recent release of RStudio of their “0-cloud” CRAN log files (but without including downloads from the primary CRAN mirror or any of the 88 other CRAN mirrors), we can now answer this question (at least for the months of Jan till May)!

By relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and visualize the popularity of R packages over time. In this post I will share some nice plots and quick insights that can be made from this great data. The code for this analysis is given at the end of this post.

Top 8 most downloaded R packages – downloads over time

Let’s first have a look at the number of downloads per day for these 5 months, of the top 8 most downloaded packages (click the image for a larger version):

We can see the strong weekly seasonality of the downloads, with Saturday and Sunday having much fewer downloads than other days. This is not surprising since we know that the countries which uses R the most have these days as rest days (see James Cheshire’s world map of R users). It is also interesting to note how some packages had exceptional peaks on some dates. For example, I wonder what happened on January 23rd 2013 that the digest package suddenly got so many downloads, or that colorspace started getting more downloads from April 15th 2013.

“Family tree” of the top 100 most downloaded R packages

We can extract from this data the top 100 most downloaded R packages. Moreover, we can create a matrix showing for each package which of our unique ids (censored IP addresses), has downloaded which package. Using this indicator matrix, we can thing of the “similarity” (or distance) between each two packages, and based on that we can create a hierarchical clustering of the packages – showing which packages “goes along” with one another.

With this analysis, you can locate package on the list which you often use, and then see which other packages are “related” to that package. If you don’t know that package – consider having a look at it – since other R users are clearly finding the two packages to be “of use”.

Such analysis can (and should!) be extended. For example, we can imagine creating a “suggest a package” feature based on this data, utilizing the package which you use, the OS that you use, and other parameters. But such coding is beyond the scope of this post.

Here is the “family tree” (dendrogram) of related packages:

To make it easier to navigate, here is a table with links to the top 100 R packages, and their links:

Package Title Downloads 1 plyr Tools for splitting, applying and combining data 84049 2 digest Create cryptographic hash digests of R objects 83192 3 ggplot2 An implementation of the Grammar of Graphics 82768 4 colorspace Color Space Manipulation 81901 5 stringr Make it easier to work with strings 77658 6 RColorBrewer ColorBrewer palettes 66783 7 reshape2 Flexibly reshape data: a reboot of the reshape package 64911 8 zoo S3 Infrastructure for Regular and Irregular Time Series (Z’s

ordered observations) 60844 9 proto Prototype object-based programming 59043 10 scales Scale functions for graphics 58369 11 car Companion to Applied Regression 57453 12 dichromat Color Schemes for Dichromats 56624 13 gtable Arrange grobs in tables 54431 14 munsell Munsell colour system 53183 15 labeling Axis Labeling 51877 16 Hmisc Harrell Miscellaneous 47836 17 rJava Low-level R to Java interface 47731 18 mvtnorm Multivariate Normal and t Distributions 46884 19 bitops Bitwise Operations 45689 20 rgl 3D visualization device system (OpenGL) 41001 21 foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase,

.. 37849 22 XML Tools for parsing and generating XML within R and S-Plus 37153 23 lattice Lattice Graphics 36597 24 e1071 Misc Functions of the Department of Statistics (e1071), TU Wien 35180 25 gtools Various R programming tools 35028 26 sp classes and methods for spatial data 34786 27 gdata Various R programming tools for data manipulation 34262 28 Rcpp Seamless R and C++ Integration 33929 29 MASS Support Functions and Datasets for Venables and Ripley’s MASS 33667 30 Matrix Sparse and Dense Matrix Classes and Methods 30740 31 lmtest Testing Linear Regression Models 30319 32 survival Survival Analysis 30186 33 caTools Tools: moving window statistics, GIF, Base64, ROC AUC, etc 29945 34 multcomp Simultaneous Inference in General Parametric Models 29871 35 RCurl General network (HTTP/FTP/…) client interface for R 28866 36 knitr A general-purpose package for dynamic report generation in R 28104 37 xtable Export tables to LaTeX or HTML 28091 38 xts eXtensible Time Series 28058 39 rpart Recursive Partitioning 27812 40 evaluate Parsing and evaluation tools that provide more details than the

default 27617 41 RODBC ODBC Database Access 26131 42 quadprog Functions to solve Quadratic Programming Problems 25433 43 tseries Time series analysis and computational finance 25144 44 DBI R Database Interface 24793 45 nlme Linear and Nonlinear Mixed Effects Models 24360 46 lme4 Linear mixed-effects models using S4 classes 24199 47 reshape Flexibly reshape data 24118 48 sandwich Robust Covariance Matrix Estimators 24016 49 leaps regression subset selection 23666 50 gplots Various R programming tools for plotting data 23251 51 abind Combine multi-dimensional arrays 22758 52 randomForest Breiman and Cutler’s random forests for classification and

regression 22401 53 Rcmdr R Commander 22131 54 coda Output analysis and diagnostics for MCMC 21900 55 maps Draw Geographical Maps 21550 56 igraph Network analysis and visualization 21423 57 formatR Format R Code Automatically 21049 58 maptools Tools for reading and handling spatial objects 20957 59 RSQLite SQLite interface for R 19671 60 psych Procedures for Psychological, Psychometric, and Personality

Research 19545 61 KernSmooth Functions for kernel smoothing for Wand & Jones (1995) 19166 62 rgdal Bindings for the Geospatial Data Abstraction Library 19064 63 RcppArmadillo Rcpp integration for Armadillo templated linear algebra library 18899 64 effects Effect Displays for Linear, Generalized Linear,

Multinomial-Logit, Proportional-Odds Logit Models and

Mixed-Effects Models 18843 65 sem Structural Equation Models 18711 66 vcd Visualizing Categorical Data 18589 67 XLConnect Excel Connector for R 18230 68 markdown Markdown rendering for R 18211 69 timeSeries Rmetrics – Financial Time Series Objects 17932 70 timeDate Rmetrics – Chronological and Calendar Objects 17838 71 RJSONIO Serialize R objects to JSON, JavaScript Object Notation 17801 72 cluster Cluster Analysis Extended Rousseeuw et al 17136 73 scatterplot3d 3D Scatter Plot 17110 74 nnet Feed-forward Neural Networks and Multinomial Log-Linear Models 17074 75 fBasics Rmetrics – Markets and Basic Statistics 16278 76 forecast Forecasting functions for time series and linear models 15638 77 quantreg Quantile Regression 15509 78 foreach Foreach looping construct for R 15405 79 chron Chronological objects which can handle dates and times 15226 80 plotrix Various plotting functions 15142 81 matrixcalc Collection of functions for matrix calculations 15107 82 aplpack Another Plot PACKage: stem.leaf, bagplot, faces, spin3R, and

some slider functions 14654 83 strucchange Testing, Monitoring, and Dating Structural Changes 14503 84 iterators Iterator construct for R 14449 85 mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness

estimation 14186 86 kernlab Kernel-based Machine Learning Lab 14135 87 SparseM Sparse Linear Algebra 13921 88 tree Classification and regression trees 13871 89 robustbase Basic Robust Statistics 13778 90 vegan Community Ecology Package 13686 91 devtools Tools to make developing R code easier 13488 92 latticeExtra Extra Graphical Utilities Based on Lattice 13253 93 modeltools Tools and Classes for Statistical Models 13233 94 xlsx Read, write, format Excel 2007 and Excel 97/2000/XP/2003 files 13097 95 slam Sparse Lightweight Arrays and Matrices 13060 96 TTR Technical Trading Rules 12894 97 quantmod Quantitative Financial Modelling Framework 12892 98 relimp Relative Contribution of Effects in a Regression Model 12692 99 akima Interpolation of irregularly spaced data 12680 100 memoise Memoise functions 12600

R code

I hope you found this post useful, and will find new ways of using this interesting dataset. Note that there are issues with how much these numbers represent the “truth”, but for now, they are the most interesting estimate of it that I know of.

# get the latest installr package: if ( ! require ( 'devtools' ) ) install. packages ( 'devtools' ) ; require ( 'devtools' ) install_github ( 'installr' , 'talgalili' ) require ( installr ) # read the data (this will take a LOOOONG time) RStudio_CRAN_data_folder 0 ) mode ( package_ip_id ) "numeric" dend_package_ip_id

p.s: This post is a follow up of me discovering, two days ago how many people use my R package.