This post details how to deploy Spark on a Kubernetes cluster.

Dependencies:

Docker v19.03.8

Minikube v1.12.1

Spark v3.0.0

Hadoop v3.3.0

Contents

Minikube

Minikube is a tool used to run a single-node Kubernetes cluster locally.

Follow the official Install Minikube guide to install it along with a Hypervisor (like VirtualBox or HyperKit), to manage virtual machines, and Kubectl, to deploy and manage apps on Kubernetes.

By default, the Minikube VM is configured to use 1GB of memory and 2 CPU cores. This is not sufficient for Spark jobs, so be sure to increase the memory in your Docker client (for HyperKit) or directly in VirtualBox. Then, when you start Minikube, pass the memory and CPU options to it:

$ minikube start --vm-driver = hyperkit --memory 8192 --cpus 4 or $ minikube start --memory 8192 --cpus 4

Docker

Next, let's build a custom Docker image for Spark 3.0.0, designed for Spark Standalone mode.

Dockerfile:

# base image FROM java:openjdk-8-jdk # define spark and hadoop versions ENV SPARK_VERSION = 3 .0.0 ENV HADOOP_VERSION = 3 .3.0 # download and install hadoop RUN mkdir -p /opt && \ cd /opt && \ curl http://archive.apache.org/dist/hadoop/common/hadoop- ${ HADOOP_VERSION } /hadoop- ${ HADOOP_VERSION } .tar.gz | \ tar -zx hadoop- ${ HADOOP_VERSION } /lib/native && \ ln -s hadoop- ${ HADOOP_VERSION } hadoop && \ echo Hadoop ${ HADOOP_VERSION } native libraries installed in /opt/hadoop/lib/native # download and install spark RUN mkdir -p /opt && \ cd /opt && \ curl http://archive.apache.org/dist/spark/spark- ${ SPARK_VERSION } /spark- ${ SPARK_VERSION } -bin-hadoop2.7.tgz | \ tar -zx && \ ln -s spark- ${ SPARK_VERSION } -bin-hadoop2.7 spark && \ echo Spark ${ SPARK_VERSION } installed in /opt # add scripts and update spark default config ADD common.sh spark-master spark-worker / ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf ENV PATH $PATH :/opt/spark/bin

You can find the above Dockerfile along with the Spark config file and scripts in the spark-kubernetes repo on GitHub.

Build the image:

$ eval $( minikube docker-env ) $ docker build -f docker/Dockerfile -t spark-hadoop:3.0.0 ./docker

If you don't want to spend the time building the image locally, feel free to use my pre-built Spark image from Docker Hub: mjhea0/spark-hadoop:3.0.0.

View:

$ docker image ls spark-hadoop REPOSITORY TAG IMAGE ID CREATED SIZE spark-hadoop 3 .0.0 8f3ccdadd795 11 minutes ago 911MB

Spark Master

spark-master-deployment.yaml:

kind : Deployment apiVersion : apps/v1 metadata : name : spark-master spec : replicas : 1 selector : matchLabels : component : spark-master template : metadata : labels : component : spark-master spec : containers : - name : spark-master image : spark-hadoop:3.0.0 command : [ "/spark-master" ] ports : - containerPort : 7077 - containerPort : 8080 resources : requests : cpu : 100m

spark-master-service.yaml:

kind : Service apiVersion : v1 metadata : name : spark-master spec : ports : - name : webui port : 8080 targetPort : 8080 - name : spark port : 7077 targetPort : 7077 selector : component : spark-master

Create the Spark master Deployment and start the Services:

$ kubectl create -f ./kubernetes/spark-master-deployment.yaml $ kubectl create -f ./kubernetes/spark-master-service.yaml

Verify:

$ kubectl get deployments NAME READY UP-TO-DATE AVAILABLE AGE spark-master 1 /1 1 1 12s $ kubectl get pods NAME READY STATUS RESTARTS AGE spark-master-6c4469fdb6-rs642 1 /1 Running 0 6s

Spark Workers

spark-worker-deployment.yaml:

kind : Deployment apiVersion : apps/v1 metadata : name : spark-worker spec : replicas : 2 selector : matchLabels : component : spark-worker template : metadata : labels : component : spark-worker spec : containers : - name : spark-worker image : spark-hadoop:3.0.0 command : [ "/spark-worker" ] ports : - containerPort : 8081 resources : requests : cpu : 100m

Create the Spark worker Deployment:

$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml

Verify:

$ kubectl get deployments NAME READY UP-TO-DATE AVAILABLE AGE spark-master 1 /1 1 1 92s spark-worker 2 /2 2 2 6s $ kubectl get pods NAME READY STATUS RESTARTS AGE spark-master-6c4469fdb6-rs642 1 /1 Running 0 114s spark-worker-5d4bdd44db-p2q8v 1 /1 Running 0 28s spark-worker-5d4bdd44db-v4d84 1 /1 Running 0 28s

Ingress

Did you notice that we exposed the Spark web UI on port 8080? In order to access it outside the cluster, let's configure an Ingress object.

minikube-ingress.yaml:

apiVersion : networking.k8s.io/v1beta1 kind : Ingress metadata : name : minikube-ingress annotations : spec : rules : - host : spark-kubernetes http : paths : - path : / backend : serviceName : spark-master servicePort : 8080

Enable the Ingress addon:

$ minikube addons enable ingress

Create the Ingress object:

$ kubectl apply -f ./kubernetes/minikube-ingress.yaml

Next, you need to update your /etc/hosts file to route requests from the host we defined, spark-kubernetes , to the Minikube instance.

Add an entry to /etc/hosts:

$ echo " $( minikube ip ) spark-kubernetes" | sudo tee -a /etc/hosts

Test it out in the browser at http://spark-kubernetes/:

Test

To test, run the PySpark shell from the the master container:

$ kubectl get pods NAME READY STATUS RESTARTS AGE spark-master-6c4469fdb6-rs642 1 /1 Running 0 10m spark-worker-5d4bdd44db-p2q8v 1 /1 Running 0 8m42s spark-worker-5d4bdd44db-v4d84 1 /1 Running 0 8m42s $ kubectl exec spark-master-6c4469fdb6-rs642 -it -- pyspark

Then run the following code after the PySpark prompt appears:

words = 'the quick brown fox jumps over the \ lazy dog the quick brown fox jumps over the lazy dog' sc = SparkContext () seq = words . split () data = sc . parallelize ( seq ) counts = data . map ( lambda word : ( word , 1 )) . reduceByKey ( lambda a , b : a + b ) . collect () dict ( counts ) sc . stop ()

You should see:

{ 'brown' : 2 , 'lazy' : 2 , 'over' : 2 , 'fox' : 2 , 'dog' : 2 , 'quick' : 2 , 'the' : 4 , 'jumps' : 2 }

That's it!

You can find the scripts in the spark-kubernetes repo on GitHub. Cheers!