Data analysis with RStudio is great, apart from R famous poor performance. What if AWS could save your days without changing your usual workflow?

I have been using R for almost ten years now; I like R, I love it. I have been pushed to switch to Python but to me RStudio remains an unbeatable state-of-the-art IDE for data analysis and research on the whole.

While getting working R code is quite straightforward, getting high performance R code may become a headache. All the usual tricks (matrix calculations, *apply functions, compiler , Rcpp ) may not bring a sufficient speedup. It is possible that you still have to wait minutes (or hours for heavy statistical simulation) for your computation to be done. And performance is a bottleneck. But aren't statistical simulations just different trials of the same things? What if you were using parallel computing to actually work them in parallel and achieve a scalable speedup? What if you were using it on AWS?

In this post I want to share my experience on how to get a working RStudio on AWS with your own files and as many CPU as you need. Without any painful or complicated devops tasks. And how to distribute loops. In 5 to 10 minutes.

Let us see define a toy example as an illustration. Consider for instance that one has a prior over n data samples so that the likelihood of the data could be something like that:

n <- 1e7 X <- rnorm(n) model <- function(x) prod(dnorm(x, mean = X, sd = abs(X)))

A single computation would take:

system.time(model(rnorm(n))) ## user system elapsed ## 1.648 0.054 1.708

so that any statistical operation (optimization, evidence estimation, etc.) with this model would probably take minutes or hours, requiring hundreds or thousands of calls to model . For instance:

system.time(replicate(10, model(rnorm(n)))) ## user system elapsed ## 16.501 0.527 17.151

This bad performance is directly proportional to the number of calls to your model . Distributing these computations over several independent computers would directly save over the computation wall-clock time, ie. the time you have to wait for your computation to be done. Hence you would directly increase the performance of your code. If you don’t want to buy a new expensive high-performance multicore laptop, cloud computing is your best bet.

RStudio on an Amazon AWS machine

Thanks to Louis Aslett Amazon AWS Machine Images (AMIs) this is as easy as a click. For the technical details on how to do the setting by yourself, you can follow this tutorial.

Let us stick to the basics :

go to Amazon AWS page and create a free account

click on Services/EC2

select Launch instance

on the left panel, select Community AMI

check out the AMI key on this website(several possibilities depending on your region) and paste it. You can also directly click on the link therein

search for the key and select the corresponding image

click Review and launch

click Edit security group

select type: HTTP

click Review and launch

proceed without a key pair and launch

click on View instance to go back to the dashboard

From this point we can see that Amazon is starting our machine on the AWS cloud. If you exactly proceed as follows with the free tier option selected, this machine is a single-core one and won’t give you any performance improvement. Usually, you would go with a better machine on a spot request. But for now, it is still worth having a look at the settings before paying for better (up to 32 cores) machines.

When the instance state goes to Running, copy/paste the Public DNS from the Description tab to your browser. It should open a page similar to this one. The default credentials are:

Username: rstudio

Password: ID of the launched instance

Validate and see the magic happen. A Welcome.R file is already open and gives you some useful info about how to change the credentials and how to link your dropbox to get your dropbox files right in the folder panel. It also specifies some interesting while less known supports from RStudio: Python, Julia or Tensorflow.

From that moment on, you may be driving a high-performance Ferrari from your modest single-core computer. Last thing you have to get introduced to: distributed loops in R .

Using foreach for running code in parallel

Now that you have a multicores machine you want to use them to run as many processes as possible in parallel. For an embarrassingly parallel problem, ie. for example when you loop through a variable to repeat the same task with different randomness:

# with a basic for loop for(i in 1:100) rnorm(1) # with R standard *apply functions sapply(1:100, function(i) rnorm(1)) replicate(100, rnorm(1))

you can expect to divide the computing time by the number of cores of your machine. Let’s do it.

For that purpose I will use the foreach package together with the doMC one. These packages have great vignettes and I go here directly to the usage with a single setting well-suited for your Amazon AWS cloud machine.

install.packages(c('doMC', 'foreach')) library(foreach) library(doMC) doMC::registerDoMC(cores = detectCores())

We are almost done. Few words about what happened :

the foreach package defines a new foreach loop which is able to run in parallel. This is not required and you will probably find it very useful even for sequential computing

the doMC actually does the job of running tasks in parallel.

Here I overwrite the default cores parameter to get as many workers as the number of cores of your Amazon AWS cloud machine:

detectCores() ## [1] 4 foreach::getDoParWorkers() ## [1] 4

You are now in position to run loops in parallel with a script like this one:

foreach(i=1:100) %dopar% { rnorm(1) }

The foreach documentation is great. Don't forget to have a look if you want to go deeper into distributed computing. At least notice that:

the foreach function is like an improved lapply , it does not do any global affectation with %dopar% (but does it make any sense?)

like the lapply function, you have to save the output with a variable:

res <- foreach(...) ...

unlike lapply or sapply you can specify the shape of the output with the .combine parameter:

(res_default <- foreach(i=1:5) %dopar% rnorm(1)) ## [[1]] ## [1] -0.7804437 ## ## [[2]] ## [1] 1.389149 ## ## [[3]] ## [1] 0.660726 ## ## [[4]] ## [1] 0.6330952 ## ## [[5]] ## [1] 1.087294 (res_vector <- foreach(i=1:5, .combine = 'c') %dopar% rnorm(1)) ## [1] 0.3198138 3.0455799 -0.3342982 -1.0266328 -1.3111736 (res_rbind <- foreach(i=1:5, .combine = 'rbind') %dopar% rnorm(1)) ## [,1] ## result.1 0.1310123392 ## result.2 0.3611695875 ## result.3 -0.0006377836 ## result.4 0.7869791290 ## result.5 -1.5703346777

etc.

Illustration

So now what about a benchmark on our previous model:

(time_rep <- system.time( replicate(10, model(rnorm(n))) )[3]) ## elapsed ## 18.049 (time_foreach <- system.time( foreach(i=1:10) %dopar% model(rnorm(n)) )[3]) ## elapsed ## 10.741

Here, with 4 cores we are able to divide the computing time by approximately 1.7. This is not exactly the performance we expected. Due to some overheads in distributing the computation, you don’t simply divide your computation time. But this would eventually become negligible as your tasks get longer. From my experience, for heavy statistical simulations you barely notice this overhead. Documentation on parallel computing may also help you optimize this, as for example using the iterators package.

In any case, you are now settled up to save heaps of time with High Performance Computing on AWS. Not mentioned here is also the real advantage of accessing and running your code from anywhere. So what if you were switching to RStudio’s R notebooks for your analysis to generate documents like this one? They allow for using together plain markdown, R, Python or Julia code all in one file. You can even share data in between code chunks of different languages using feather. Amazon AWS also lets you request GPU. What about starting big data and machine learning projects online with Tensorflow, Keras and Spark from this remote AWS machine? Stay posted! (and click follow-me just below)

Acknowledgement

If you ever read this blog post, once again a huge thanks Mr. Louis Aslett for providing these AMIs. It saved my life once and got me started with distributed computing.

Do you need data science services for your business? Do you want to apply for a job at Sicara? Feel free to contact me.

Thanks to Tristan Roussel and Adil Baaj.