Whether used in academia, industry or journalism, working with R involves importing and exporting a lot of data. While the basic functions to read and write files are known to all users, different methods have been developed over the years to optimise this process.

In this article, we’ll have a look at the most efficient ways to read and write permanent files (i.e. in plain-text formats such as CSV), and to save and load binary files, a solution often overlooked by R users but much better suited to regular analysis of a given dataset.

Setting up our benchmark

We’ll be using functions from four different packages (readr, data.table, feather, and fst), and comparing their performance using the microbenchmark package.

install.packages ( c ( "microbenchmark" , "readr" , "data.table" , "feather" , "fst" )) library ( microbenchmark ) library ( readr ) library ( data.table ) library ( feather ) library ( fst )

The dataset we’ll be using as an example contains random data over 20 columns and 500,000 rows, for a reasonable size of 115 MB in CSV format. The 20 variables are a mix of integers, real numbers, dates and strings.

filename <- "dataset.csv"

Permanent input/output

For the first part of this analysis, we’ll look at permanent input/output, i.e. reading and writing files in common formats in data science, especially when files are shared between people. In other words, what’s the most efficient way to open a CSV file you received or downloaded; and what’s the most efficient way of outputting your own file to share with somebody else?

Reading a plain-text file

3 functions are available to us:

read.csv from the read.* series of functions in base R;

from the series of functions in base R; read_csv from the read_* series of functions in the readr package;

from the series of functions in the readr package; fread from the data.table package.

Let’s use microbenchmark to import our file using those three functions; the microbenchmark() function will execute each expression 10 times and average the elapsed time.

microbenchmark ( data <- read.csv ( filename ), data <- read_csv ( filename ), data <- fread ( filename ), times = 10 , unit = "s" ) ## Unit: seconds ## expr min lq mean median uq max neval ## data <- read.csv(filename) 25.662873 26.829712 27.915110 27.344555 27.681976 35.389826 10 ## data <- read_csv(filename) 2.303900 2.502276 2.971335 3.020295 3.329171 3.618546 10 ## data <- fread(filename) 3.273398 3.467531 3.707234 3.777934 3.938012 4.146957 10

The read_csv and fread functions imported our file in 3 seconds, against 28 seconds for the read.csv function. This improvement is mostly due to the way those two functions identify the type of each column - by guessing it based on a sample of values.

More generally, read_csv and fread tend to assume that your file is quite “clean”, more than read.csv does: import functions in base R offer a lot of optional arguments to deal with comments, missing values, trailing spaces in strings, etc.

But if the file we’re working with has been generated in a clean way, read_csv and fread should deal with it without any error and guess the correct data types, much faster than read.csv .

Writing a plain-text file

We’ll now test the three equivalent functions to write the same file instead of reading it:

write.csv from the write.* series of functions in base R;

from the series of functions in base R; write_csv from the write_* series of functions in the readr package;

from the series of functions in the readr package; fwrite from the data.table package (introduced in 2016).

microbenchmark ( write.csv ( data , "baseR_file.csv" , row.names = F ), write_csv ( data , "readr_file.csv" ), fwrite ( data , "datatable_file.csv" ), times = 10 , unit = "s" ) ## Unit: seconds ## expr min lq mean median uq max neval ## write.csv(data, "baseR_file.csv", row.names = F) 13.8066424 13.8248250 13.9118324 13.8776993 13.9269675 14.3241311 10 ## write_csv(data, "readr_file.csv") 3.6742610 3.7999409 3.8572456 3.8690681 3.8991995 4.0637453 10 ## fwrite(data, "datatable_file.csv") 0.3976728 0.4014872 0.4097876 0.4061506 0.4159007 0.4355469 10

The results are impressive: readr improved our writing time from 14 seconds in base R to 4 seconds with write_csv - but fwrite improved this performance again by a factor of 10, writing the file in only 0.4 second!

Note that both write_csv and fwrite include an “automatic” mode for quotes: columns will only be quoted if necessary, i.e. if separators are found in some of their values. In datasets with many columns, this can save space compared to the base R process:

## File Size.MB ## baseR_file.csv 123.0 ## readr_file.csv 115.0 ## datatable_file.csv 112.5

Efficient storage for analysis

The optimisations described above are known to most users who deal with plain-text files almost everyday. However, many are unaware of the solutions that exist to optimise the frequent loading of the same files for analysis. This is particularly useful for people who work on one (or several) specific datasets for an extended period of time (weeks or even months), and regularly close and open their R session. In this context, importing plain-text files every time can be very long and frustrating, even with optimised functions such as fread .

Fortunately, R offers many ways to store R objects (including data frames) in a binary format, reducing the time needed to load those objects back into the environment later:

One of the better known formats is RDATA, included in base R: it allows the user to save an object or a whole environment into a binary, compressed file, and quickly re-load the objects into memory. Saving and loading a data frame with RDATA thus recreates the exact same data frame, with the same name.

The RDS format is very similar and also comes in base R: it works similarly and stores the data in the same way as RDATA, but allows the user to reimport an object under a different name.

Finally, the feather and fst packages aim at improving on those formats, by creating even faster saving and loading solutions.

Saving R objects

Let’s now compare all of those solutions.

A couple of notes:

By default the saveRDS() function uses compression, but it’s possible to disable it with compress = FALSE , so we’ll include this possibility as well.

function uses compression, but it’s possible to disable it with , so we’ll include this possibility as well. The write_fst() function can take an argument compress = N where N is a value in the range 0 to 100, indicating the amount of compression to use. The default is 50, but we’ll test both extremes (0 and 100).

microbenchmark ( save ( list = "data" , file = "RDATA_file.rdata" ), saveRDS ( data , "baseRDS_comp_file.rds" , compress = T ), saveRDS ( data , "baseRDS_noncomp_file.rds" , compress = F ), write_feather ( data , "feather_file.feather" ), write_fst ( data , "fst_comp0_file.fst" , compress = 0 ), write_fst ( data , "fst_comp100_file.fst" , compress = 100 ), times = 10 , unit = "s" ) ## Unit: seconds ## expr min lq mean median uq max neval ## save(list = "data", file = "RDATA_file.rdata") 7.96338920 8.01461160 8.05888978 8.04906326 8.10588731 8.1770513 10 ## saveRDS(data, "baseRDS_comp_file.rds", compress = T) 7.88657332 7.96057057 8.11137097 8.05012876 8.08823067 8.7744078 10 ## saveRDS(data, "baseRDS_noncomp_file.rds", compress = F) 0.36618762 0.37159785 0.38247426 0.38253665 0.38849595 0.4012216 10 ## write_feather(data, "feather_file.feather") 0.11037106 0.11065625 0.11337884 0.11259769 0.11543701 0.1191594 10 ## write_fst(data, "fst_comp0_file.fst", compress = 0) 0.08293022 0.08501644 0.08889665 0.08691963 0.09289635 0.1002668 10 ## write_fst(data, "fst_comp100_file.fst", compress = 100) 2.16989012 2.19063069 2.24112161 2.24799947 2.27968396 2.3116543 10 ## File Size.MB ## 1 RDATA_file.rdata 37.2 ## 2 baseRDS_comp_file.rds 37.2 ## 3 baseRDS_noncomp_file.rds 70.3 ## 5 feather_file.feather 65.1 ## 6 fst_comp0_file.fst 65.3 ## 7 fst_comp100_file.fst 32.9

It is easy to see in those results the different implementations of those binary formats:

Among the compressed files, the RDATA and RDS functions in base R create files that are much smaller (37 MB), but much slower (8 seconds). But the compress = 100 version of fst is even more compressed (33 MB) and only took 2.2 seconds to write to disk!

version of fst is even more compressed (33 MB) and only took 2.2 seconds to write to disk! When compression isn’t required, all implementations generate a file around 65-70 MB. saveRDS took about 0.37 second, write_feather only 0.11 second, and write_fst with compress = 0 only 0.08 second.

Loading R objects

We can then compare the performance of each equivalent reading function:

microbenchmark ( load ( "RDATA_file.rdata" ), readRDS ( "baseRDS_comp_file.rds" ), readRDS ( "baseRDS_noncomp_file.rds" ), read_feather ( "feather_file.feather" ), read_fst ( "fst_comp0_file.fst" ), read_fst ( "fst_comp100_file.fst" ), times = 10 , unit = "s" ) ## Unit: seconds ## expr min lq mean median uq max neval ## load("RDATA_file.rdata") 0.8775526 0.8892134 0.9321619 0.9253322 0.9480082 1.0236830 10 ## readRDS("baseRDS_comp_file.rds") 0.8749464 0.8875014 0.9349717 0.9151411 0.9591819 1.0452654 10 ## readRDS("baseRDS_noncomp_file.rds") 0.5509024 0.5629992 0.6108047 0.5701881 0.7095914 0.7261663 10 ## read_feather("feather_file.feather") 0.1927875 0.1976274 0.2614764 0.2484733 0.3249248 0.3466373 10 ## read_fst("fst_comp0_file.fst") 0.2129975 0.2145049 0.2475619 0.2207756 0.2458871 0.3549952 10 ## read_fst("fst_comp100_file.fst") 0.2835336 0.2865808 0.3434378 0.3261273 0.3608169 0.4697726 10

Here we see that compression and time-to-load are not entirely correlated: non-compressed files are generally loaded faster (~0.6 second for non-compressed RDS, ~0.25 second for feather and fst with compress = 0 ), while compressed versions take longer (~0.93 second for RDS and RDATA). However, the fst version with compress = 100 only took ~0.34 second to load, which is much faster than other compressed files, and not that much longer than the uncompressed solutions!

Ultimately, the trade-off must be judged by each user for each situation, while bearing in mind that space is generally cheaper than time: if storing the original CSV file is possible, then storing a smaller binary file alongside should rarely be a problem; and saving significant time on each data import will be much more valuable.

Overall, write_fst seems to achieve a great balance, by offering flexibility (letting you choose your own compression value anywhere between 0 and 100) but still loading highly-compressed files extremely fast.

Verdict

Multiple solutions coexist to offer very efficient data input/output and storage in R. However, as of January 2018, the best solutions for most users and files are: