Knit launches YARN applications from Python.

Knit provides Python methods to quickly launch, monitor, and destroy distributed programs running on a YARN cluster, such as is found in traditional Hadoop environments.

Knit was originally designed to deploy Dask applications on YARN, but can deploy more general, non-Dask, applications as well.

Using conda, Knit can also deploy fully-featured Python environments within YARN containers, sending along useful libraries like NumPy, Pandas, and Scikit-Learn to all of the containers in the YARN cluster.

Install¶ Use pip or conda to install: conda install knit - c conda - forge # or pip install knit

Launch Generic Applications¶ Instantiate knit with valid ResourceManager/Namenode IP/Ports and create a command string to run in all YARN containers from knit import Knit k = Knit ( autodetect = True ) # autodetect IP/Ports for YARN/HADOOP Create a software environment with necessary packages env = k . create_env ( 'my-environment' , packages = [ 'python=3.5' , 'scikit-learn' , 'pandas' ], channels = [ 'conda-forge' ]) # specify anaconda.org channels Run a command within that environment on multiple containers cmd = 'python -c "import sys; print(sys.version_info);"' app_id = k . start ( cmd , num_containers = 2 , env = env ) # start application The start method also takes parameters to define the resource requirements of the application like num_containers= , memory= , virtual_cores= , env= , and files= .

Launch Dask¶ Knit makes it easy to launch Dask on Yarn: from knit.dask_yarn import DaskYARNCluster env = k . create_env ( 'my-environment' , packages = [ 'python=3.6' , 'scikit-learn' , 'pandas' , 'dask' ], channels = [ 'conda-forge' ]) # specify anaconda.org channels cluster = DaskYARNCluster ( env = env ) cluster . start ( nworkers = 10 , memory = 4096 , cpus = 2 ) from dask.distributed import Client client = Client ( cluster ) # Connect local Dask client to cluster If you want to connect to the remote Dask cluster from your local computer as is done in the last line then your local and remote environments should be similar.

Packaged Environments¶ Yarn clusters typically lack strong Python environments with common libraries like NumPy, Pandas, and Scikit Learn. To resolve this, Knit creates redeployable conda environments that can be shipped along with your Yarn job, effectively bringing a fully-featured Python software environment to your Yarn cluster. To achieve this Knit uses redeployable conda environments. Every time you create a new environment Knit will use conda locally to manage and download dependencies, and then will wrap those packages into a self-contained zip file that can be shipped to Yarn applications.