What is Spark and Why?

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. In the past, I’ve wrote an intro on how to install Spark on GCE and since then, I wanted to do a follow up on the topic but with more real world example of installing a cluster. Luckily to me, a reader of the blog did the work! So after I got his approval, I wanted to share with you his script.

Installing Spark Cluster

In order to install it you just need to:

1. Install gcutil and authenticate your project.

2. Open a terminal and get the git repository with the python script in it.

$ git clone https://github.com/sigmoidanalytics/spark_gce.git

$ cd spark_gce

$ python spark_gce.py



You will need to create a new project in the Google Developer Console before you running ‘spark_gce.py’ and make sure to add all the params.

Here is an example:

spark_gce.py project-name slaves slave-type master-type identity-file zone cluster-name

project-name: One of the hardest thing in software… Choosing good names. Here we wish a good name for our cluster.

One of the hardest thing in software… Choosing good names. Here we wish a good name for our cluster. slave: how many machines we will have in the cluster as slaves.

how many machines we will have in the cluster as slaves. slave-type: Instance type. For example: n1-standard-1

Instance type. For example: n1-standard-1 master-type: Instance type for the master. Choose something powerful (e.g. n1-standard-1 and above).

Instance type for the master. Choose something powerful (e.g. n1-standard-1 and above). identity-file: Identity file to authenticate. It will be around: ~/.ssh/google_compute_engine once you authenticate using gcutils.

Identity file to authenticate. It will be around: ~/.ssh/google_compute_engine once you authenticate using gcutils. zone: Specify the zone where you are going to launch the cluster. For example: us-central1-a

Specify the zone where you are going to launch the cluster. For example: us-central1-a cluster-name: Name the Spark cluster.

That’s it.

Misc