I currently have the need for machine learning tools that can deal with observations of the order of 10 millions in the context of binary classification. That kind of data is a few GBs in size and it fits comfortably nowadays in the RAM of a decent single machine. It is a trivial task for linear models, as there are plenty of open source tools that can train a logistic regression with this amount of data on a single machine in a few seconds, even while using only 1 processor core (many of these tools are single-threaded). Linear models are also the gold standard of large-scale machine learning that can run on clusters, processing very large distributed datasets.

However, many problems would profit from modeling non-linearities. The state of the art in predictive accuracy are machine learning algorithms such as random forests, boosting, neural networks (and more recently deep learning) and (non-linear) support vector machines. In this post we’ll take a look at random forests, perhaps the easiest to train of the above (as there are few “knobs” to set) and often one of the most accurate algorithms. In fact, they are regularly used as a starting point in Kaggle competitions. I’m planning to look at the other methods as well, so more posts will follow.

Random forests have several commonly known implementations in R packages, Python scikit-learn, Weka, H2O, Spark MLLib, Mahout, Revo ScaleR, among others. For the purposes of this post, I am interested in which tools can deal with 10 million observations and train a random forest in a reasonable time (i.e. a few hours at most). Therefore, I’m analyzing the scalability, speed and accuracy of various random forest implementations.

In terms of scalability, I’m studying if the algorithms are able to complete (in decent time) for the given data sizes with given memory (RAM) constraints. Note that the largest dataset in this study is less than 1GB, so scaling out to multiple machines should not be necessary even if the tool can run in a distributed mode, therefore it is not the focus in this study. Actually, some machine learning algorithms perform relatively poorly in the multi-node setting, where communication is over the network rather than via updating shared memory. While we are certainly living in a “big data” craze, more and more voices are recognizing the value of choosing simpler solutions and not just using distributed computing unnecessarily because of the hype. — Speed (in the single node setting) is determined by the computational complexity of the algorithm, but also if the implementation can use multiple processor cores or not. Accuracy is measured by the area under the ROC curve (AUC).

Training datasets of sizes 10K, 100K, 1M, 10M are generated from the well-known airline dataset, using data from years 2005 and 2006. A test set of size 100K is generated from the same dataset using year 2007. The task is to predict whether a flight will be delayed by more than 15 minutes. The tests have been carried out on a Amazon EC2 c3.8xlarge instance (32 cores, 60GB RAM). For some of the models that ran out of memory with the larger data sizes a r3.8xlarge instance (32 cores, 250GB RAM) has been used occasionally.

The models have been trained with default values for the (hyper-)parameters. For each algo/tool and each size n we observe the following: training time, maximum memory usage during training, CPU usage on the cores, and AUC as a measure for predictive accuracy. Times to read the data, pre-process the data, score the test data are also observed but not reported (e.g. they were not the bottleneck).

More details on this setup and the entire study, along with all the code used to get the results, can be found at this Github repository. I’d like to emphasize this is not aimed to be a comprehensive benchmark. It started from my need at work to find a tool that can deal with datasets of similar size and structure. A more comprehensive benchmark would study different datasets of different structure (e.g. dense, sparse, etc.), how results are changing as a function of hyper parameter values, the scaling out to multiple nodes for the distributed systems, larger datasets, and the like.

I also want to emphasize the importance of testing the tools on your data: your specific data structure, data size, etc. It is very common lately (especially in a startup environment) to go with the hype and use a tool primarily because of its social media prowess without trying out less flashy alternatives which may better fit your needs. I have seen too many people falling into the “big data” trap and using these “big data” tools on data sets that don’t really require such an approach (but this way adding extra costs in resources needed, efficiency or usability). Those who may be falling into this trap are often overheard saying, “We are building a Hadoop cluster. Our data is about 100 GB total. Our largest table is a few GBs.” Therefore, please don’t stop your inquiry after reading a few blog posts on the topic or even this benchmark. Instead, think about your tasks and try out several tools to find out what works for your problem.

In our experiments, random forests with 500 trees have been trained in each tool with default hyper-parameter values. The training times and AUC as a function of the dataset size are plotted in the figures below (with more details available on Github).





The R implementation (randomForest package) is slow and inefficient in memory use. It cannot cope by default with a large number of categories, therefore the data had to be one-hot encoded. The implementation uses 1 processor core, but with 2 lines of extra code it is easy to build the trees in parallel using all the cores and combine them at the end. However, it runs out of memory already for n = 1M. I have to emphasize this has nothing to do with R per se, as it is instead the particular (C and Fortran) RF implementation used by the randomForest package that is inefficient. (Note that I still stand by my argument that R is the best data science platform especially when it comes to data munging and visualization.)

The Python (scikit-learn) implementation is faster, more memory efficient and uses all the cores. Variables needed to be one-hot encoded (which is more involved than for R) and for n = 10M doing this exhausted all the memory. Even if using a larger machine with 250GB of memory (and 140GB free for RF after transforming all the data) the Python implementation runs out of memory and crashes for this larger size. The algo finished successfully though when run on the larger box with simple integer encoding (which for some datasets/cases might be actually a good approximation/choice).

The H2O implementation is fast, memory efficient and uses all cores. It deals with categorical variables automatically. It is also more accurate than R/Python, which may be because of dealing properly with the categorical variables, i.e. internally in the algo rather than working from a previously 1-hot encoded dataset (where the link between the dummies belonging to the same original variable is lost).

Spark (MLlib) implementation is somewhat slower, provides the lowest accuracy and it crashes already at n = 1M due to inefficient memory handling. With 250G of RAM it finishes for n = 1M, but runs out of memory for n = 10M. However, as Spark can run on a cluster one can throw in even more RAM by using more nodes. Alternatively, on a single machine, it is possible to train random forests with a smaller number of trees (but then accuracy decreases). I also tried to provide the categorical variables encoded simply as integers and passing the categoricalFeaturesInfo parameter, but that made training much slower. A convenience issue, reading the data is more than one line of code and Spark does not provide a one-hot encoder for the categorical data (therefore I used R for that). Note again the low prediction accuracy vs the other methods. One can improve a bit by increasing the maximum depth of trees (but only to Spark’s limit of 30), but then training slows down further and AUC is still lower than with the other methods. Finding the reason for the lower AUC would need more investigation (the reason might be that predict for Spark decision trees returns 0/1 and not probability scores therefore the random forest prediction is based on voting not probability averaging, or different stopping criteria, or just an algorithm that uses some approximations that hurts accuracy). Update: See some improvements in the comments here, here and here.

I also tried xgboost, a popular library for boosting which is capable to build random forests as well. It is fast, memory efficient and of high accuracy. Note the different shapes of the AUC and runtime vs dataset sizes for H2O and xgboost, however.

In addition to the above, several other random forest implementations have been tested (Weka, Revo ScaleR, Rborist R package, Mahout) but all of them proved slow and/or unable to scale to the larger sizes. While not the focus of this study, there are signs that running the distributed random forests implementations (e.g. H2O) on multiple nodes does not provide the speed benefit one would hope for (because of the high cost of shipping the histograms at each split over the network).

Besides random forests, I also trained various implementations of logistic regression (i.e. linear models). Their AUC as a function of dataset size is plotted below. One can see that the linear models’ accuracy increases only a little from 100K to 1M and it is virtually the same for 1M and 10M. This is because a simple linear structure can be extracted from a smaller dataset and having more data points will not change the classification boundary significantly. On the other hand, more complex models such as random forests can further improve with increasing data size by further adjusting the classification boundary. However, one needs to pay a price in increased computational time for these more complex models. For example, if using H2O it takes 4000 seconds to train a random forest on the largest size and only 5 seconds to train a linear model.

An interesting point to note is that the AUC for the random forest trained on 100K observations is better than the AUC on a linear model trained on 10M observations. Therefore the answer to the question “more data or better algorithms?” often posed nowadays is that it depends (rather than the often argued “more data usually beats better algorithms”).

To summarize, the commonly used R and Python random forest implementations have serious difficulties in dealing with training sets of tens of millions of observations. H2O or xgboost can deal with these datasets on a single machine (using memory and multiple cores efficiently). Most data scientists are using R or Python because of their excellent coverage of data munging, visualization and modeling (e.g. machine learning), their high-level APIs and the environments for interactive/exploratory data analysis they provide. (Note: this author argues that R is a better choice between the two). There is still a heated debate though about what to do when your data or computation is beyond what these two tools can handle. In fact, both R and Python can easily handle datasets of 10 million rows for aggregates, joins and other common data munging (i.e. “small/simple analytics”) operations taking at most a few seconds (e.g. see this benchmark), therefore still supporting a nice interactive workflow. For machine learning (i.e. “big/complex analytics”) on such datasets, with respect to algorithms such as random forests for example, one can use H2O or xgboost right from within R or Python almost seamlessly. Overall, I cannot help to observe that non-linear machine learning on such data sizes looks more like a high-performance computing task rather than a “big data” one.

Essentially, data scientists should ask themselves if their analytics tasks really need “big data” tools before choosing tools that add extra complexity (sometimes along with efficiency or reliability issues) and that are still largely missing the multitude of statistical and data manipulation libraries and high-level APIs that make using R or Python so productive. Without taking the time to understand the tradeoffs and make an informed decision, many will spend a large part of their time fighting the tools instead of gaining value and insights from their data.