The goal here is show you how to set up your machine to run Spark using notebooks.

My motivation to do that was the fact that, a time ago, I did not know a lot around that, and when I first started to learn Spark I found hard to set up for just run a simple “hello world” sample. I still don’t know that much, but now I feel more comfortable to share this post, and hopefully I can help someone to get started smoothly than me at the first days :)

Getting Spark working for Python

The first step is to install the pip packages ‘pyspark’ and ‘jupyter’ on the machine.

sudo pip install pyspark jupyter

After that, we need to set the `SPARK_HOME` environment variable on our “~/.bashrc” with the path where spark installation lives. So in my case is `/usr/local/lib/python2.7/dist-packages/pyspark`

Also, as last step, we need to configure two others PySpark environment variables, so we can change the default behavior of ‘pyspark’ command.

export PYSPARK_DRIVER_PYTHON=”jupyter”

export PYSPARK_DRIVER_PYTHON_OPTS=”notebook”

That way, when we type ‘pyspark’ on the command line and hit ENTER, instead of opening the shell, Jupyter Notebook will be opened (just like that).

Getting Spark working for Scala

The easiest way I’ve found was using the Spark Notebook, with the Docker approach.

First I’ve tried the docker image with the most recent version of all components:

docker pull andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.1.1-hadoop-2.8.0-with-hive

It didn’t worked. So I tried with another version and then, bang! It worked just fine.

docker pull andypetrella/spark-notebook:0.7.0-scala-2.10.6-spark-2.1.0-hadoop-2.7.2-with-hive

Then, to use the image, you just need run the following:

docker run -p 9001:9001 -v /home/wesley/notebooks/:/opt/docker/notebooks/host andypetrella/spark-notebook:0.7.0-scala-2.10.6-spark-2.1.0-hadoop-2.7.2-with-hive