Introduction

So you’ve created your AWS account and you’ve spun up your 1st Hadoop cluster. You’ve got your keypairs, and maybe you’ve run some pyspark code before, but now you’re looking to run SparkR on a Amazon AWS cluster.



Here are some instructions I put together that helped me:



Log onto AWS, go to EMR, and create a cluster. Go to Advanced Settings and run through the prompts Follow the following website instructions: https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/ Create Cluster When it’s done check that the following is there: Go to ec2 console Click Instances on Left Select your instance In the Description tab, locate Security Groups and click the available group link Click edit button on Inbound tab Click Add Rule and select SSH for type, Port Range 22, and Source Anywhere Connect with putty :)

Now go back to your cluster and click “Enable Web Connection” and follow the instructions. Once you’ve connected to EMR via PuTTY, revisit the instructions in step 3 which will tell you to go to (http://<EMR master instance>:8787) which you must do through Firefox after FoxyProxy has been installed properly. Now you can log into Rstudio via Firefox with “hadoop” as username and password. Before running any SparkR code, run the following commands #Set the path for the R libraries you would like to use.

#You may need to modify this if you have custom R libraries.

.libPaths(c(.libPaths(), ’/usr/lib/spark/R/lib’))



#Set the SPARK_HOME environment variable to the location on EMR

Sys.setenv(SPARK_HOME = ’/usr/lib/spark’)



#Load the SparkR library into R

library(SparkR, lib.loc = c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”)))



#Initiate a Spark context and identify where the master node is located.

#local is used here because the RStudio server

#was installed on the master node



sc <- sparkR.session(master = “local[*]”, sparkEnvir = list(spark.driver.memory=“2g”))



sqlContext <- sparkR.session(sc)

Now you can run SparkR on the cluster!!!! Here is some example code

irisDF <- suppressWarnings(createDataFrame(iris))

# Fit a generalized linear model of family “gaussian” with spark.glm

gaussianGLM <- spark.glm(irisDF, Sepal_Length ~ Sepal_Width + Species, family = “gaussian”)

summary(gaussianGLM)

Here is an excellent video from the AWS Big Data Meetup in San Fran in 2016 to get a better visual on what I’m talking about:





Hope this post helps people, and happy modeling!! :-D