Executing the IPython Notebook on the cluster¶

Once I made sure everything functions correctly on my local machine and perhaps a smaller set of data I want to run the full, long job on the cluster going massively parallel. For that I first copy ( scp ) the notebook to the cluster and then use ssh to connect to the cluster. From hereon out, everything will be happening on the cluster.

Most clusters run some scheduler that queues and distributes your job. The Brown cluster runs SLURM but others are configured in a similar way so it shouldn't be hard to adapt these scripts.

You have to write a script with some special comments that instruct the scheduler on how many cores you want ( -n ), how long the job is supposed to run ( --time ) etc:

#!/bin/sh #SBATCH -J ipython #SBATCH -n 64 #SBATCH --time=48:00:00

Next in our script we want to launch the IPython parallel cluster. First, we launch the IPython controller which distributes the jobs. We also specify that it should accept connection from external IPs ( --ip='*' ) instead of being restricted to localhost :

echo "Launching controller" ipcontroller --ip = '*' & sleep 10

We next need to launch our engines. Most schedulers have a command to distribute a shell command on all your reserved nodes. SLURM has srun which will execute the command 64 times (as specified above). You could also use mpirun here.

srun ipengine & sleep 25

Finally, we want to run our IPyNB on the cluster. For this we'd like to just execute an IPyNB like a Python script. Luckily, there are some scripts that do exactly that. For our purposes, checkipnb.py does the job. You'll have to download it and save it to the directory.

echo "Launching job" python checkipnb.py $1 echo "Done!"

The $1 refers to the command line argument of the cluster script so that I can decide which IPyNB to run when I launch the job. That's the script, I now only need to copy the data and the IPyNB to the cluster and submit my job via:

sbatch submit_ipython_notebook.sh MyIPyNB.ipynb

(Other schedulers have different commands to submit jobs like qsub ).

What happens next is that SLURM will schedule my job in the queue. Once it gets started with access to 64 cores, the shell script is executed and launches ipcontroller and 64 instances of ipengine which automatically know how to connect and register to the controller. checkipnb.py then executes each cell in MyIPyNB.ipynb consecutively. The %%px cells will bring all the engines into the correct state while the map_sync() call distributes the work. Once the computation is done, the results are saved to the pickle output file which I can then copy back (using scp ) to my local setup to analyse.

What is nice is that I never have to touch the cluster script again. I simply write an IPyNB that computes my analyses in parallel and if the computations become too big I can just copy them to the cluster and launch them there with minimal overhead.

The full script is as follows (note that I'm running a local Anaconda install; this really helps as you normally don't have admin access on the cluster and don't want to have to keep asking them to install custom packages):