There are a number of resources for setting up IPython Parallel out there on The Internet, but they all seemed a bit out of date and/or not quite accurate. The iPython/Jupyter split does not help things either. Eventually I decided to just RTFM and learn to setup a cluster from scratch.

This setup effectively has three parts: Jupyter, ipengine, and ipcontroller. Jupyter runs the notebook component in which you actually write your python code. ipengine is a remote IPython engine that executes your code — potentially on a different machine than you are writing it on. ipcontroller is what each engine connects to, it manages the cluster resources. I’ll go through my recommendations for configuring each one. I ended up with Ubuntu 16 — which now uses systemd — so those using upstart will need to adapt my scripts.

Jupyter and ipcontroller will run on a master node, and ipengine will run on your slaves. Here’s the general idea:

Setup your network

It’s perfectly acceptable to use the existing default network created for you in GCE. For an extra layer of security, you’ll only want your master node to be accessible from the internet. Create a firewall rule that allows traffic on port 8888, but only to instances with a custom tag.

Setting up the master node

Begin by creating a VM for the Jupyter/ipcontroller instances — remember to use your tag. I went with an n1-standard-1 for this as I don’t use it for any computation. You could certainly set up ipengine instances on the master VM, however then you’d have to configure those instances slightly differently. I think that this would create more work than is necessary. Here’s how I get python installed and ready on this VM.

$ # running as the ubuntu user $ ssh-keygen -t rsa

$ # save in the standard ~/.ssh/id_rsa with no password

$

$ ./Miniconda2–4.1.11-Linux-x86_64.sh

$ # go through installation, say yes when asked to add to bash $ wget https://repo.continuum.io/miniconda/Miniconda2-4.1.11-Linux-x86_64.sh chmod +x Miniconda2–4.1.11-Linux-x86_64.sh$ ./Miniconda2–4.1.11-Linux-x86_64.sh$ # go through installation, say yes when asked to add to bash $ source ~/.bashrc

$ conda install jupyter ipyparallel

$ ipython profile create --parallel --profile=default

$ ipcontroller --reuse --ip=*

$ # ctrl-c to quit

The last line will generate two ipcontroller configuration files that should look similar to this:

$ cat ~/.ipython/profile_default/security/ipcontroller-engine.json

{

"control": 39048,

"task": 51689,

"hb_ping": 53171,

"mux": 56738,

"pack": "json",

"hb_pong": 40851,

"ssh": "",

"key": "...",

"registration": 54503,

"interface": "tcp://*",

"iopub": 60933,

"signature_scheme": "hmac-sha256",

"unpack": "json",

"location": "10.128.0.2"

} $ cat ~/.ipython/profile_default/security/ipcontroller-client.json

{

"control": 59727,

"task": 45957,

"notification": 34435,

"task_scheme": "leastload",

"mux": 49227,

"iopub": 33081,

"ssh": "",

"key": "...",

"registration": 54503,

"interface": "tcp://*",

"signature_scheme": "hmac-sha256",

"pack": "json",

"unpack": "json",

"location": "10.128.0.2"

}

Eventually we’ll copy these files to our slaves. This will give the slaves the information they need to connect to the master. To make this more reliable, I’d recommend changing the location field from an IP to the hostname of your master instance. Now we need to initialize and configure Jupyter, follow these instructions to get a public server running. After Jupyter is setup, the next step is to create systemd config files to run it and ipengine at startup.

$ cat /lib/systemd/system/jupyter.service

[Unit]

Description=Jupyter notebook [Service]

Type=simple

PIDFile=/var/run/jupyter-nodebook.pid

ExecStartPre=/bin/bash -c “/bin/systemctl set-environment PATH=/home/ubuntu/miniconda2/bin:$PATH && /bin/systemctl set-environment HOME=/home/ubuntu”

ExecStart=/home/ubuntu/miniconda2/bin/jupyter notebook --config /home/ubuntu/.jupyter/jupyter_notebook_config.py

WorkingDirectory=/home/ubuntu/notebooks

User=ubuntu

Group=ubuntu

PermissionsStartOnly=true [Install]

WantedBy=multi-user.target $ cat /lib/systemd/system/ipcontroller.service

[Unit]

Description=ipcontroller [Service]

Type=simple

PIDFile=/var/run/ipcontroller.pid

ExecStartPre=/bin/bash -c "/bin/systemctl set-environment PATH=/home/ubuntu/miniconda2/bin:$PATH && /bin/systemctl set-environment HOME=/home/ubuntu"

ExecStart=/home/ubuntu/miniconda2/bin/ipcontroller --reuse

User=ubuntu

Group=ubuntu

PermissionsStartOnly=true [Install]

WantedBy=multi-user.targe

Now get these systemd scripts running:

$ sudo systemctl daemon-reload

$ sudo systemctl enable jupyter.service

$ sudo systemctl enable ipcontroller.service

$ sudo systemctl start jupyter

$ sudo systemctl start ipcontroller $ # if you’d like to see the log output:

$ systemctl status jupyter.service

$ systemctl status ipcontroller.service

Setting up the slave nodes

Spin up a new VM to act as the slave template. I chose an n1-standard-32 for this purpose. First, follow the same steps at the top to get jupyter and ipyparallel installed — although you don’t need to run ipcontroller. Next you’ll want to add the contents of the public key from the master, located in ~/.ssh/id_rsa.pub, to ~/.ssh/authorized_hosts on your slave. This will let you perform a passwordless ssh login from the master to slave — very handy down the road. Then scp the contents of ~/.ipython/profile_default/security from the master to the same location on the slave. Now the slave ipengine instances will know how to connect to the master ipcontroller. Finally, setup the ipengine systemd script. You’ll need to install GNU Parallel as well. Please note that I’ve hardcoded the number of instances/cores to 32. You should set this to the number of cores on your machine.

$ sudo apt-get install parallel

$ cat /lib/systemd/system/ipengine.service

[Unit]

Description=ipengine [Service]

Type=simple

PIDFile=/var/run/ipengine.pid

ExecStartPre=/bin/bash -c “/bin/systemctl set-environment PATH=/home/ubuntu/miniconda2/bin:$PATH && /bin/systemctl set-environment HOME=/home/ubuntu”

ExecStart=/bin/bash -c “/usr/bin/parallel -n0 -j 32 /home/ubuntu/miniconda2/bin/ipengine ::: {1..32}”

User=ubuntu

Group=ubuntu

PermissionsStartOnly=true [Install]

WantedBy=multi-user.target

And run them:

$ sudo systemctl daemon-reload

$ sudo systemctl enable ipengine.service

$ sudo systemctl start ipengine $ # if you’d like to see the log output:

$ systemctl status ipengine.service

That’s it! Now your master node will launch jupyter and ipcontroller at startup, and your slave will launch N instances of ipengine at startup which will connect to the master. Create an image and launch as many slaves as your heart desires!

Keeping python libraries in sync

Whenever you need to install a python dependency you must do it on the master and all slave nodes. I recommend Ansible for this task. Setup a hosts file that looks something like this:

$ cat ~/cluster-hosts

localhost

cluster-1

cluster-2

cluster-3

Now you can install dependencies on the entire cluster with one command:

$ ansible all -i cluster-hosts -a “/home/ubuntu/miniconda2/bin/pip install scikit-learn"

And of course… pics or it didn’t happen

untapt matches developers and data engineers to amazing jobs using machine intelligence.