We present dask-yarn , a library for deploying Dask on Apache YARN . We discuss the status of this tool, and possibilities for future work.

These tools empower users to use Dask for data-engineering tasks on Hadoop clusters, providing access to a field traditionally occupied by Spark and other "big-data" tools. If you use a Hadoop cluster and have been wanting to try Dask, I hope you'll give dask-yarn a try.

Apache YARN is the resource management and job scheduling framework native to Hadoop clusters. Many data-processing frameworks like Spark or Flink support YARN as a deployment option. As a contributor to Dask , I sought to improve our YARN support. This work resulted in two new libraries:

Usage

Dask-Yarn provides an implementation of Dask's Cluster interface. This is the same interface provided by other Dask deployment libraries like dask-kubernetes and dask-jobqueue. It provides methods for starting, stopping, and scaling a Dask cluster on YARN, all from within Python.

The library currently is intended to be used from an edge node - user driving code (whether a script or an interactive terminal) is run on the edge node, while Dask's scheduler and workers are run in YARN containers. For comparison, this is similar to Spark's client mode for YARN deployment. In the future a dask-yarn submit command may be developed to allow submitting the driving code to also run in a container as part of the application (similar to spark-submit in cluster mode).

Dask-Yarn is agnostic to how Python environments are managed, but provides special support for distributing Conda environments packaged using conda-pack. If an alternative method is desired, users can specify this by providing their own specification. Please see Distributing Python Environments in the dask-yarn documentation for more information.