When working in multi node environment like Spark/Hadoop clusters, docker diminishes the barrier to entry. By barrier to entry, I mean the need to have a constantly running EMR cluster, when you are still in development phase. With Docker, you can quickly setup a 4-5 node cluster on a single machine and start coding your spark job. You can understand what Docker is and why you would use Docker on these links.

Benefits

You can very easily version control your environment

Barrier to entry for working with clusters (Spark/Hadoop) etc. reduces a lot. You no longer need EMR access to run a cluster which will have a cost associated with it.

Installing Docker

Follow this official guide

Manual

Quickly for ubuntu steps are:

sudo apt-get update sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update sudo apt-get install docker-ce sudo docker run hello-world

With Ansible

You can use following command to use ansible to install docker for you

sudo python2.7 -m pip install ansible \ && sudo ansible-galaxy install --force angstwad.docker_ubuntu \ && echo '- hosts: all roles: - angstwad.docker_ubuntu ' > /tmp/docker_ubuntu.yml \ && sudo ansible-playbook /tmp/docker_ubuntu.yml -c local -i 'localhost,'

Setting up a cluster

Follow this post.

You will be able to run a local spark cluster with 4 commands.

Quick overview:

mkdir spark_cluster; cd spark_cluster echo 'version: "2" services: master: image: singularities/spark command: start-spark master hostname: master ports: - "6066:6066" - "7070:7070" - "8080:8080" - "50070:50070" worker: image: singularities/spark command: start-spark worker master environment: SPARK_WORKER_CORES: 1 SPARK_WORKER_MEMORY: 2g links: - master ' > docker-compose.yml sudo docker-compose up -d # sudo docker-compose scale worker=2

Extending other images

With docker you can build on top of someone else’s image. For example, here I will extend singularities/spark image, make my custom spark configuration changes, and push the final version to my own docker hub repo.

Pushing your changes to Docker hub

To create a fork from some base repo (singularities/spark), these are the steps

sudo docker run -it singularities/spark # Run base repo. This will open a shell # Make your changes to the image in this container sudo docker login --username = chaudhary --password = lol sudo docker commit <container ID from docker ps> chaudhary/my-repo-name # Commit changes sudo docker tag <image ID from docker images> chaudhary/my-repo-name # Tag for pull to work properly sudo docker push chaudhary/my-repo-name

Now that you have pushed this image, you can start a new container from this image as shown below:

sudo docker run -it chaudhary/my-repo-name

Resources

For more information read the official getting started guide.