Update, September 2018: The R Journal recently published an interesting review of the state of data structures and collections in R: Timothy Barry (2018), “Collections in R: Review and Proposal”, vol 10 no. 1, pages 455-471.

R is the lingua franca of academic statistics. Many papers introducing new statistical methods are accompanied by a package posted on CRAN, R’s repository of useful packages and tools. Undergraduate programs often teach R and use it throughout their courses, as do most graduate programs—most PhD students I know are implementing their work in R as they do their research. R’s popularity is exploding, and even Microsoft has their own R distribution these days.

But R is an unusual language. It was designed by statisticians for statisticians, and has a number of convenience features—for example, there are no scalars. The number 17 is just a vector of length one, and the + operator can add arbitrary vectors elementwise by dispatching to fast C code. On the one hand, this means I can write foo + bar for two large vectors and get an efficient sum, but on the other hand, it means 1 + 2 has to go through the same loop instead of being converted to a fast addition.

There are other curious features of R, like Ross Ihaka’s famous example of a function whose return value x is randomly local or global:

Ihaka says “No sensible language would allow this,” and misfeatures like this seriously limit R’s performance by making it difficult to efficiently run R code. (How can an optimizer deal with a variable that is randomly local or global?)

But the performance of a program depends on more than just how quickly the interpreter can execute the instructions. For a program doing vectorized operations on large arrays of numbers, sure, but what about a program that needs different data structures, like sets, graphs, or trees? What about a hash table? Do we have the tools to efficiently store and search data in appropriate data structures?

Lists: powerful but inflexible Lists are flexible and widely used in R. They’re the basis of S3 classes, they can store heterogeneous and deeply nested data, and the default vectorization machinery (like apply and lapply ) produces lists. There’s extended list subsetting syntax with the [ and [[ operators, along with $ , and useful features like record types (or structs) can be emulated by simply using a list. But the next problem with R’s approach to lists is the names: they may only be strings. In Python (or Racket, or many other languages), on the other hand, I could use any immutable type as a key: foo = {( 1 , 2 ): 7 , "bar" : "baz" } foo[( 1 , 2 )] # 7 Any immutable type can be hashed, so there’s no reason it can’t be used as a key in a dictionary. But R only supports strings as list keys. This may seem like a minor niggle, but it turns out that arbitrary keys can be amazingly useful for O(1) lookups of things other than strings. They’re also useful for storing sets, collections of items for which every item appears only once and which support efficient unions and intersections. I often make use of sets in my own code: Finding duplicates in sets of data other than strings.

Storing sets; for example, my Conway’s Game of Life code (in Racket) keeps track of which cells are live by storing the current set of live cells, in terms of their (x, y) coordinates. (This way, I don’t need to store a matrix of all the grid cells—I can just store the live ones, and hence support an infinitely large grid. This trick came from Chris Genovese.) But in R, I couldn’t use this trick: there’s no way to look up if a cell is live without an O(n) search, and I’d have to convert the coordinates to “x,y” strings instead of using more natural c(x,y) vectors. Code that tries to get around these restrictions has to use convoluted workarounds. The sets package, for example, offers native set data structures in R, but stores them as sorted lists. (Sorting is necessary to make it easy to check if two sets are equal.) For sets containing non-numeric elements, the elements are sorted by their string representation, so every object has to be converted to a string. To perform set intersection operations, or others that would require O(1) access, the list is converted to a hash table by converting all its elements to strings and using an R environment. All these convolutions cost efficiency and complexity. Of course, the user of the sets package need not worry with the details, but if they care about performance, they’ll notice the cost.