In the previous post I’ve played with AWS lightsail – simple PaaS providing virtual machines. Having years of experience in administration of HPC systems I’ve thought about configuration of Slurm [1] cluster based on lightsail.

Slurm cluster in AWS lightsail

Automatic nodes provisioning is already available in Slurm [2], it’s even called “Elastic computing” which reminds us about AWS EC2 service. To configure it we have to provide scripts responsible for nodes creation and removal – for this Slurm makes use of the suspend/resume code initially designed for power saving mechanism.

The key lines in slurm.conf [3] are :

NodeName=aws-com[02-05] Procs=1 State=CLOUD SuspendTime=60 ResumeTimeout=150 SuspendProgram=/usr/sbin/slurm_suspend ResumeProgram=/usr/sbin/slurm_resume

Basically it means that nodes being idle for 60 seconds are going to be suspended by execution of /usr/sbin/slurm_suspend script. In case of jobs waiting in queue nodes are going to be created by /usr/sbin/slurm_resume . Both scripts are executed by slurmctld with one argument – name of the node. Nodes in state CLOUD are not displayed on the list of available nodes, however, slurm allocates memory in its cluster bitmaps on slurmctld start it keeps the information about nodes invisible for end-user.

This has some important implication – number of cores on each cloud node has to be configured in slurm.conf , if one would like to use consumable resources select plugin. This is not an issue, it’s just important to remember that the node with specific name should always has the same number of CPUs and this number should be aligned with slurm.conf.

What is characteristic for such setup is that IP address of nodes may be unknown before node creation and may change from time to time. Suspend-resume process just starts new VM that may receive new IP from cloud provider.

Having slurm+munge configured on my VM, I’ve created a snapshot of it with aws cli command:

aws lightsail create-instance-snapshot --instance-snapshot-name aws-slurm-snapshot --instance-name aws-com01

For those playing with lightsail on free tier, it’s important to note that you’re going to be charged per GB of the snapshot – initial grant is for computing resources only.

Let’s take a look on resume [4] and suspend [5] scripts, in both we simply set AWS keys as variables, comparing the aws commands to the versions I’ve shown you in previous post you may notice --region option. I’ve added it to commands since region was not specified in configuration.

You may also notice a few interesting lines in the resume script, namely:

TMPFILE=$(mktemp) cat < $TMPFILE #Start "user defined" user data hostname ${1}.cinekCluster END

and --user-data file://${TMPFILE} to aws create-instance-from-snapshot. This file is intended to be a cloud-init config file, so you may ask why it doesn’t have appropriate shebang line? In fact this was the most problematic part for me. I’ve tried to use #cloud-init syntax, but it didn’t work. Checking cloud-init scripts on the created node revealed that lightsail already makes use of it and my user data is added at the end of bash script.

The only missing part now is ip address update before slurmd is started. I did this adding a few lines to the startall function in slurm init script [6]:

if [[ ${prog} == slurmd ]] then #configure ip in slurmctl IPADDRESS=$(ip a s eth0 | grep "inet "| awk '{split($2,a,"/");print(a[1])}') scontrol update nodename=$(hostname -s) nodeaddr=$IPADDRESS nodehostname=$(hostname -s) fi

The big lesson learned came to me by mistake – writing this article I’ve by accident published my AWS keys on gist (copy-paste from console). I’ve received an email notification about this from Amazon within a few seconds. My lightsail limits were reduced to 0 so no one who had seen my gist was able to create instance on my account. SSH access to my running VM was locked. When I recreated the keys and removed the compromised pair everything was restored. Impressing!

[1] https://slurm.schedmd.com/

[2] https://slurm.schedmd.com/elastic_computing.html

[3] https://gist.github.com/cinek810/069344914df08491390027623086f10f

[4] https://gist.github.com/cinek810/6e2d068a8501f40bb120b1adbf173ad8

[5] https://gist.github.com/cinek810/4dd6da644a71a7425d2d1bef6d8578e3

[6] https://gist.github.com/cinek810/d4bfa2d660c51ba0bb17bd0204fd7606