By Jayden Soni

As a software engineering intern at Civis this summer, I worked on a project supporting our Identity Resolution product. Under the hood, it relies heavily on Apache’s distributed processing framework Spark to run our proprietary person matching algorithm. While we currently maintain a Kubernetes cluster to handle the jobs, scripts and notebooks run in Civis Platform, our Spark applications are currently run directly using Amazon EMR. Version 2.3, however, has made it possible to run Spark applications on Kubernetes instead. To help streamline our data pipeline infrastructure and support the possibility of a more cloud-agnostic future, I set out to explore running Spark on our Kubernetes infrastructure.

As a framework for this exploration and evaluation, I identified four main goals for my implementation:

Comparing Configurations: I decided to compare different configuration possibilities for submitting Spark applications to Kubernetes and run an example job locally. Local Cluster: I set up a local cluster with robust logging and persistence capabilities to support the debugging, caching, and checkpointing requirements of our matching processes. Testing: To test these capabilities, I planned to run a part of our matching algorithm on my local cluster and develop a way to programmatically track its state for implementation in a larger pipeline. Deployment and Implementation: Finally, I wanted to deploy and test my implementation on Civis’s staging Kubernetes clusters.

These goals shifted with my findings and priorities as I proceeded with the project, with my work occurring in four main phases.

Phase 1: Ramping up with Spark and Kubernetes

I started my process familiarizing myself with Kubernetes, Spark, and the relationship between them. After learning the basics of Kubernetes nodes, pods, and many other objects, as well as the Spark driver-executor model, I found that spark-submit was the best way forward for running Spark on Kubernetes. As seen below, this command spins up a Spark driver pod to run a prepackaged application, with configuration flags handling key properties like the number of executor pods requested and their required CPU and memory resources.

Figure 1: Submission mechanism for deploying Spark applications on Kubernetes

While alternatives like the spark-on-k8s-operator currently offer more functionality, they also present unnecessary complexity and are not under as much active development as spark-submit (which is maintained as a key part of Apache Spark and has already progressed to replicate much of the appealing functionality of the Spark Operator). My primary complaints with spark-submit are that its command-line paradigm (with a large number of flags for configuration) is clunky compared to Kubernetes yaml documents, and that certain tooling, such as the configuration of local disk through Kubernetes volumes, is not available in version 2.4.0. Solutions for both of these complaints, however, seem to be on the way in the near future.

Phase 2: Configuring a Local Minikube Cluster

The setup I configured in Phase 1 could successfully run simple example Spark applications that ship with the distribution, but to fit the larger demands of Civis’s matching algorithm I needed to implement a few more features. The first was basic logging and monitoring. Since most of the computation associated with a Spark application happens in the Kubernetes cluster, the client by default receives little information as to why the application succeeds or fails upon termination. More of this can be teased out through the Kubernetes logs associated with the Spark driver pods, which persist upon termination of the process while everything else is cleaned up, or by monitoring the process using kubectl or the Kubernetes dashboard. However, many underlying details are only accessible through Spark’s own API and UI.

Figure 2: The minikube dashboard for the spark-test namespace

Figure 3: The Spark History UI for a completed Golden Table run

The Spark UI is spun up on the driver pod in the Kubernetes cluster, so accessing it during a running process is as simple as forwarding that port to the client machine. This resource, however, is cleaned up along with the rest of the Spark resources upon termination of the process. To access data from completed processes, it was necessary to set up a Spark history server on Kubernetes, and configure Spark to log events to an HDFS directory in the same cluster. Using a NodePort service, I made this history server available to my local machine. Other storage options, such as Amazon S3, may be simpler in many cases, but I had already set up an HDFS configuration for my next task: persistence.

In Spark, persistence is used during a running application both for caching results of completed processes to avoid recomputation, as well as checkpointing to assist with fault tolerance and avoid long lineages of RDDs and DataFrames. For caching, it is preferable to store intermediate results in memory, but to avoid OOM errors, Spark has the option to spill any overflow to disk. On Kubernetes, the default Spark configuration uses emptyDir volumes for this task, without a lot of tooling options as of version 2.4.0. The inability to directly manage disk resources as we can with CPU and memory is one limitation of this version of Spark, but the ability to configure mounted HostPaths and PersistentVolumes for Spark local storage has very recently been added.

Checkpointing is a little more complicated. Spark executor storage — the local disk we can use for caching — is considered unreliable, since it can be lost completely with the failure of that executor node. Local checkpointing is fine for truncating lineages, but to ensure fault tolerance, reliable storage such as HDFS or Amazon S3 is required. I decided to implement HDFS to be able to preserve data locality and avoid any issues with S3’s eventual consistency paradigm. Since our processes only required that we persist data across the lifetime of a given Spark application, I prioritized the speed, flexibility, and predictability of HDFS’s namenode-datanode setup over the long-term availability and durability of S3’s copies across multiple systems configuration. Setting this all up did, however, require a fair amount of legwork, even with the help of previous work in the area.

Figure 4: Kubernetes Infrastructure of HDFS alongside running Spark applications

The general idea was to establish an HDFS namenode pod on the same node as the Spark driver pod, as well as one datanode pod per node to be able to store files on the same machines as the executor pods using those files. Datanode-to-executor pod relationships can then be mapped through the namenode using a service. More complex configurations, such as a high-availability namenode configuration in which two namenodes (one active and one waiting) are maintained for higher reliability, are also possible for production environments.

Finally, to close the loop on the lifecycle of a Spark application, we needed to be able to track its state as running, succeeded, or failed, to be able to integrate Spark processes into larger more complicated pipelines. This was not too difficult to accomplish with the help of the Kubernetes API and associated Python client, allowing us to watch running driver pods and query terminated processes for a “succeeded” or “failed” status.

Phase 3: A Pivot to Writing Code in PySpark

Halfway through my internship, I had accomplished two out of four of my main goals, having learned the basics of spark-submit and set up a local Kubernetes cluster with more advanced capabilities. Reflecting on my experience I realized that without having written much code using PySpark myself, I had gaps in my understanding of Spark’s effectiveness as a distributed processing framework. At this point, I was lucky to be able to shift my third and fourth goals to include writing a new piece of our matching algorithms I could use to test my Spark implementation.

As a team, Identity Resolution’s current focus is a feature expansion to our pipeline called the Golden Table. Our product currently matches records from disparate datasets to identify which ones provide data on the same subjects. With the Golden Table, we will not only provide clusters of records that represent the same information, but also select the best of that information for inclusion in one golden table. In production, this will greatly reduce client workload, as well as make the product even more accessible to non-technical users.

I focused on implementing required back-end selection logic and DataFrame transformations. Through this process, I was able to better grasp the ways in which Spark partitions data for parallel processing, as well as how to write robust code in the context of its lazy-evaluation paradigm. I gained experience with both unit and integration testing, and ended up with a piece of production Spark code to test on my Kubernetes implementation.

Phase 4: Running a Golden Table Job on Kubernetes

The final step of my project consisted of testing my Spark on Kubernetes configuration with the Golden Table code I had written. I ran four different spark-submit configurations to separately test each phase of my work. First, I submitted the Spark Pi example that ships with Spark 2.4.0 to my local Kubernetes cluster and made sure it ran as expected. I also supplied a path to an application that did not exist and made sure that command failed as expected. Second, I submitted a modified version of the Spark Pi example that required persistence, both for caching to disk and checkpointing to HDFS. Finally, I submitted a golden table job with an HDFS path as my output path, and examined those outputs to ensure that the results were as expected.

In all of my tests, I was also able to access the Spark UI properly during the running of the applications, and the Spark History Server after they completed. I found the Kubernetes logs of the Spark driver pods to be especially helpful tools for diagnosing and solving bugs that arose for each test. To wrap up, I handed off to my team working python scripts to replicate my local environment from scratch and run the same tests as a starting point for any future research in this area.

Takeaways

In my view, further research is warranted. There is a substantial amount of legwork required to set up a Kubernetes cluster to run Spark applications, but once things are set up, managing these aspects of the cluster is not time-intensive. While certain resource tooling, such as managing the disk resources of Spark pods through volumes, is not currently available, new versions of Spark will soon provide this. All in all, Spark on Kubernetes is an exciting area of development right now, and I expect a lot of growth and polish in the next couple of years.

Beyond this initial proof of concept, however, there are still questions that need answers before the use of this setup in production. I did not have time to set up a similar configuration in a multi-node staging cluster, so it remains to be seen how the performance would differ from market tools like Amazon EMR, as well as what any cost savings of such a setup might be. This change would also have an impact on pipeline design for processes that require Spark, potentially helping to streamline them, but also potentially requiring substantial legwork, not to mention the work required to scale my setup to multiple nodes. It is unclear whether those costs outweigh potential benefits, but the possibilities of Spark on Kubernetes are nonetheless exciting.

This work would not have been possible without the help of my manager Salil Gupta or my mentor Coleman Smith. A big thank you in particular to Salil for his help with Kubernetes, as well as Coleman and Mike Heilman for vastly improving my understanding of Spark. Thanks so much as well to the rest of the IDR team — Virginia Fu, Will Raphaelson, Heidi Fleck, and Skipper Seabold — for their indispensable feedback, expertise, and support throughout my internship.

Inspired by Jayden’s work here at Civis? Applications for our 2020 summer internships are open. Learn more about our internship program and apply here.

Jayden Soni is a senior studying Computer Science at Northwestern University in Evanston, IL. He completed a software engineering internship at Civis Analytics during the summer of 2019.