To start working with large data sets in R, the first question is how to load the data for further analysis. Our data tests consisted of large CSV files having data as matrices, which was needed for further correlation calculations. The options we have, to load it in R are:

>> read.csv

>> use fread function of package data.table

function of package >> use read.big.matrix of package big memory

In order to test load performance for each, we first used the following machine with R (64-bit) installed on it for a Matrix with data as Bits:

vCPU: 4 (High Frequency Intel Xeon E5-2670)

Memory: (GB) 15

Storage: SSD

Here is summary of what we found:



Our Observations –



>> Clearly the fread function of data.table package performs the best, by far.

>> With 15 GB RAM, the options other than fread could only load data with file size around 3.5 GB and fread could load data around 7 GB. This helps us in selecting the right hardware going forward to continue working on correlation calculations for more big matrices.

NOTE – We would be covering our findings about R’s use with DFS like HDFS and loading of CSVs from it in the coming blogs.