I’ve been an R user for a few years now and the data.table package has been my staple package for most of it. In this post I wanted to talk about why almost every script and RMarkdown report I write start with:

My memory issues

I started working on my licenciate thesis (the argentinian equivalent to a Masters Degree) around mid 2016. I had been using R for school work and fun for some time and knew that I wanted to perform all my analysis in R and write my thesis in RMarkdown. In the end, I did but in the process I had to learn new tools and also create my own (which materialised in the metR package).

The big problem I encountered early on was how to store and manipulate data. My main source of data were the output of atmospheric models which are stored usually in regularly spaced grids. The most natural way to store that kind of data would be in a multidimensional array like this:

file <- "~/DATOS/NCEP Reanalysis/air.mon.mean.nc" subset <- list(level = 1000:800, time = c("1979-01-01", "2018-12-01")) temperature <- metR::ReadNetCDF(file, subset = subset, out = "array")[[1]] str(temperature)

## num [1:144, 1:73, 1:3, 1:473] -30.5 -30.5 -30.5 -30.5 -30.5 ... ## - attr(*, "dimnames")=List of 4 ## ..$ lon : chr [1:144] "0" "2.5" "5" "7.5" ... ## ..$ lat : chr [1:73] "90" "87.5" "85" "82.5" ... ## ..$ level: chr [1:3] "1000" "925" "850" ## ..$ time : chr [1:473] "1979-01-01" "1979-02-01" "1979-03-01" "1979-04-01" ...

This is very memory-efficient, but it doesn’t play well with a tidydata framework. Subsetting, filtering and operating on groups using arrays is rather awkward –not to mention that dimensions can only be characters! Furthermore, I had to transform it to a dataframe each time I wanted to plot it with ggplot2. What I needed was something more like this

temperature <- metR::ReadNetCDF(file, subset = subset) str(temperature)

## Classes 'data.table' and 'data.frame': 14916528 obs. of 5 variables: ## $ level: num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ... ## $ lat : num 90 90 90 90 90 90 90 90 90 90 ... ## $ lon : num 0 2.5 5 7.5 10 12.5 15 17.5 20 22.5 ... ## $ air : num -30.5 -30.5 -30.5 -30.5 -30.5 ... ## $ time : POSIXct, format: "1979-01-01" "1979-01-01" ... ## - attr(*, ".internal.selfref")=<externalptr>

The problem is that this representation is much less memory-efficient and my aging laptop couldn’t handle it. While it would eventually read it, even the simplest operation would crash my R session. This was due to the fact that R loooves to copy on modify and this is deadly if you’re dealing with data that fits on your memory but just barely.

Enter data.table and its modify by reference functionality. Unlike regular data.frames or tibbles, data.table objects can be easily modified without copying the entire object! And this means that you can safely work with objects that take more than half your available RAM.

For this reason I often say that without data.table I wouldn’t have gotten my degree!