spark-submit

The bin/spark-submit script of the uncompressed Spark download is what initiates the work for submitting Spark apps. If you take a look at the spark-submit script, you’ll notice that it doesn’t do much aside from setting the Spark environment, but it does also call the bin/spark-class script which does more lifting toward running apps. If you take a look at the spark-class script, you’ll notice it performs the following:

Discovers Java Discovers Spark jar files Adds the launcher build dir to the classpath Runs the build command which launches Spark

If you pass the correct Kubernetes related arguments to spark-submit, you can direct your Spark app to be run in pods on a Kubernetes cluster. Figure 1.0 illustrates what happens when spark-submit is run against a Kubernetes cluster.

Figure 1.0

As you can see in Figure 1.0, there’s a basic workflow that shows spark-submit being run; the Spark app is submitted to the kube-apiserver and then scheduled by kube-scheduler. When the scheduler deploys the pods, it first creates a driver pod which in-turn makes a call back to the kube-apiserver to create and schedule the executor pods. The Spark apps are then coordinated to the executor pods. Once all apps are complete, the executor pods are terminated and the driver pod remains until manually deleted, or garbage collection occurs.

Configuration

In order to configure the use of spark-submit, it’s a good idea to put the command and flags into a shell script for easy reference. Below is an example of a script that calls spark-submit and passes the minimum flags to deliver the SparkPi app over 5 instances (pods) to a Kubernetes cluster. In the example below, the master is is an AWS ELB; if you’re using Minikube, you would substitute in the address which can easily be obtained by running kubeclt cluster-info .

#!/usr/bin/env bash

— master k8s://

— deploy-mode cluster \

— name spark-pi \

— class org.apache.spark.examples.SparkPi \

— conf spark.executor.instances=5 \

— conf spark.kubernetes.container.image=gcr.io/cloud-solutions-images/spark:v2.3.0-gcs \

— conf spark.kubernetes.driver.pod.name=spark-pi-driver \

local:///opt/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar bin/spark-submit \— master k8s:// https://ksceptre-master-ext-1377063874.us-west-2.elb.amazonaws.com:6443 — deploy-mode cluster \— name spark-pi \— class org.apache.spark.examples.SparkPi \— conf spark.executor.instances=5 \— conf spark.kubernetes.container.image=gcr.io/cloud-solutions-images/spark:v2.3.0-gcs \— conf spark.kubernetes.driver.pod.name=spark-pi-driver \local:///opt/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar

SparkPi

If you’re wondering what SparkPi is, there’s a decent explanation of it and what it does over at Hortonworks.

If interested, SparkPi scripts can be found in the uncompressed Spark tarball by navigating to examples/src/main/{code_type}...

Dockerfile

When it comes to providing a Spark Docker image, there are many available across the various registries out there, i.e. Docker Hub, GCR, etc., and there’s also a Dockerfile within the uncompressed tarball (kubernetes/dockerfiles/spark/Dockerfile) that can be used for creating a Spark image, which you can then upload to a registry and reference in your spark-submit configuration with the following flag spark.kubernetes.container.image .

Figure 1.1 shows an example of a scratch Kubernetes cluster where the above SparkPi app has been submitted. You can see the Spark service, Spark driver pod, and the 5 Spark executor pods. In this paradigm, the Spark service exists to allow the Spark driver pod to be accessed by org.apache.spark.deploy.k8s.submit.Client ; the Spark driver pod then creates and instructs the Spark executor pods as previously illustrated in Figure 1.0.

Figure 1.1

As previously stated, the above example which supports Figure 1.1 includes the minimum requirements for getting a Spark app to run on a Kubernetes cluster. That said, there is a large number of configurations for Spark, and then there are those specific to Spark on Kubernetes. Some of the more notable configurations for Kubernetes include the following:

spark.kubernetes.namespace

spark.kubernetes.container.image

spark.kubernetes.driver.label.[LabelName]

spark.kubernetes.executor.label.[LabelName]

spark.kubernetes.node.selector.[labelKey]

spark.kubernetes.driver.limit.cores

spark.kubernetes.executor.limit.cores

spark.kubernetes.authenticate.driver.serviceAccountName

Of course there could be several other configurations for your specific cluster and app, but those are just some important ones that deal mostly with resource allocation and identification.

It’s also been stated that future versions of Spark will include the ability to configure node affinity, and this should be a very welcome addition to allow for node selection through ordered preference. I would assume that once the affinity/anti-affinity features are out of beta, an update to Spark will be out shortly thereafter.