** Guide Updated **

This guide is an evolution from this original guide. Unless the Kerrighed Team comes up with a substantially different version, this is the only update to this guide I will ever make as the steps are pretty much the same for all svn versions I have tested.



On this version:

-- Added changes for the latest Kerrighed svn 5586

-- Fixed some steps to make them more readable and error free.

-- Added simple MPI example to see how your program interacts with the cluster.

-- Added troubleshooting section for some situations in which the nodes do not receive the image from the controller.

Thank you all for your previous comments and emails.

Rodrigo Sarpi

------------

| internet |

------------

|

router1

|

v

+--------------------------------------------------+

| eth1 --- controller: 192.168.1.106 (given by router1)|

| eth0 --- controller: 10.11.12.1 (manually set) |

+--------------------------------------------------+

|

router2

| |

| + -->eth0--node1: 10.11.12.101 (static IP Address)

|

v

eth0--node2: 10.11.12.102 (static IP Address)

--------------------------------------------------------

Debian Lenny with default kernel 2.6.26-2-686

All steps done as root on the controller

==

Step 1:

-- dhcp server will provide ip addresses to the nodes.

-- tftpd-hpa will deliver the image to the nodes

-- portmap converts RPC (Remote Procedure Call) program numbers into port numbers.

NFS uses that to make RPC calls.

-- syslinux is a boot loader for Linux which simplifies first-time installs

-- nfs will be used to export directory structures to the nodes

When installing these packages accept the default settings presented for dhcp3 and TFTP.

#apt-get install dhcp3-server tftpd-hpa portmap syslinux nfs-kernel-server nfs-common

These packages are for MPI (see under TESTING below). You can install them on the controller to compile your MPI programs, then move them to any of the nodes and start the program from the node; or you can create, compile, and execute your MPI programs on any of the nodes. Either way, you need these packages on the node to execute your MPI code no matter option you choose:

#apt-get install openmpi-bin openmpi-common libopenmpi1 libopenmpi-dev

==

Step 2:

Identify ethernet interfaces which will be used by the dhcp server.

For this setup, we are setting up "eth0" as the network card that's

feeding the nodes of the internal network.

#nano /etc/default/dhcp3-server

INTERFACES="eth0″

==

Step 3:

General configuration for the DHCP server.

Make a backup of original configuration file in case you want to use it as a reference later on.

cat /etc/dhcp3/dhcpd.conf > /etc/dhcp3/dhcpd.conf.bkp

#nano /etc/dhcp3/dhcpd.conf

# General options

option dhcp-max-message-size 2048;

use-host-decl-names on;

deny unknown-clients;

deny bootp;

# DNS settings

option domain-name "nibiru_system"; # any name will do

option domain-name-servers 10.11.12.1; # server’s IP address: dhcp and tftp

# network

subnet 10.11.12.0 netmask 255.255.255.0 {

option routers 10.11.12.1; # server IP as above.

option broadcast-address 10.11.12.255; # broadcast address

}

# ip addresses for nodes

group {

filename "pxelinux.0"; # PXE bootloader in /var/lib/tftpboot

option root-path "10.11.12.1:/nfsroot/kerrighed"; # bootable system

#the other laptop

host node1 {

fixed-address 10.11.12.101; # first node

hardware ethernet 00:0B:DB:1B:E3:89;

}

#desktop

host node2 {

fixed-address 10.11.12.102;

hardware ethernet 00:16:76:C1:F7:D4;

}

server-name "nibiru_headnode"; # Any name will do

next-server 10.11.12.1; # Server IP where the image is. For this network it's the same machine

}

==

Step 4:

Configure the TFTP server.

#nano /etc/default/tftpd-hpa

RUN_DAEMON="yes"

OPTIONS="-l -s /var/lib/tftpboot"

==

Step 5:

Configure inetd for TFTP server.

nano /etc/inetd.conf

tftp dgram udp wait root /usr/sbin/in.tftpd /usr/sbin/in.tftpd -s /var/lib/tftpboot

==

Step 6:

This directory will hold the image for the nodes to boot from.

#mkdir /var/lib/tftpboot/pxelinux.cfg

==

Step 7:

Copy PXE bootloader to the TFTP server.

#cp -p /usr/lib/syslinux/pxelinux.0 /var/lib/tftpboot/

==

Step 8:

Fallback configuration. If the TFTP client cannot find a PXE bootload configuration

for a specific node, it will use this one.

#nano /var/lib/tftpboot/pxelinux.cfg/default

LABEL linux

KERNEL vmlinuz-2.6.20-krg

APPEND console=tty1 root=/dev/nfs nfsroot=10.11.12.1:/nfsroot/kerrighed ip=dhcp rw session_id=1

==

Step 9:

This step is optional but recommended.

In /var/lib/tftpboot/pxelinux.cfg create separate files for *each* node.

The filename should be the IP address of the node represented in HEX format.

Example: 10 --> A; 11 -->B; 12 -->C; 101 -->65.

So for 10.11.12.101 it should be 0A0B0C65.

#nano /var/lib/tftpboot/pxelinux.cfg/0A0B0C65

LABEL linux

KERNEL vmlinuz-2.6.20-krg

APPEND console=tty1 root=/dev/nfs nfsroot=10.11.12.1:/nfsroot/kerrighed ip=10.11.12.101 rw session_id=1

==

Step 10:

Future node system. This directory will have the node’s bootable files, etc.

#mkdir /nfsroot/ && mkdir /nfsroot/kerrighed

==

Step 11:

Tell NFS what to export

#nano /etc/exports

/nfsroot/kerrighed 10.11.12.0/255.255.255.0(rw,no_subtree_check,async,no_root_squash)

==

Step 12:

Tell NFS to export above file system

#exportfs -avr

==

Step 13:

Create bootable system.

some developers reported that they needed the trailing "/" after "kerrighed"

as in: debootstrap --arch i386 lenny /nfsroot/kerrighed/ http://ftp.us.debian.org/debian

#apt-get install debootstrap

debootstrap --arch i386 lenny /nfsroot/kerrighed http://ftp.us.debian.org/debian

You should get this output:

I: Retrieving Release

I: Retrieving Packages

I: Validating Packages

I: Resolving dependencies of required packages...

I: Resolving dependencies of base packages...

I: Checking component main on http://ftp.us.debian.org/debian...

I: Retrieving libacl1

I: Validating libacl1

[..]

I: Configuring tasksel-data...

I: Configuring tasksel...

I: Base system installed successfully.

==

Step 14:

Isolate our node system to configure Kerrighed.

#chroot /nfsroot/kerrighed

==

Step 15:

Set root password for isolated system

#passwd

Enter new UNIX password: (nibirucluster)

Retype new UNIX password: (nibirucluster)

passwd: password updated successfully

==

Step 16:

Use the /proc directory of the node’s image as the bootable system’s /proc directory



mount -t proc none /proc



==

Step 17:

You might get Perl related errors when installing packages on to the node. To suppress those errors, type in the console:



nano .profile



export LC_ALL=C

or

just copy and paste into console:



export LC_ALL=C



==

Step 18:

Add basic packages needed by the node to communicate with the controller



nano /etc/apt/sources.list



deb http://ftp.us.debian.org/debian/ lenny main non-free contrib

deb-src http://ftp.us.debian.org/debian/ lenny main non-free contrib

deb http://security.debian.org/ lenny/updates main

deb-src http://security.debian.org/ lenny/updates main



apt-get update

apt-get install automake autoconf libtool pkg-config gawk rsync bzip2 libncurses5 libncurses5-dev wget lsb-release xmlto patchutils xutils-dev build-essential subversion dhcp3-common nfs-common nfsbooted openssh-server



You need these packages on the node to compile and execute your MPI code (see under TESTING below).



apt-get install openmpi-bin openmpi-common libopenmpi1 libopenmpi-dev



libopenmpi-dev may not be required if you only want to execute your code on the node. However, it is needed if you want to compile your program on the node itself.

==

Step 19:

Preparing mount points



mkdir /config



==

Step 20:

Set mount points



nano /etc/fstab



# UNCONFIGURED FSTAB FOR BASE SYSTEM

proc /proc proc defaults 0 0

/dev/nfs / nfs defaults 0 0

configfs /config configfs defaults 0 0

==

Step 21

Set hosts to lookup



nano /etc/hosts



127.0.0.1 localhost

10.11.12.1 nibiru_headnode

10.11.12.101 node1

10.11.12.102 node2

==

Step 22:

Create a symlink to automount the bootable filesystem.



ln -sf /etc/network/if-up.d/mountnfs /etc/rcS.d/S34mountnfs



==

Step 23:

Configure network interfaces



nano /etc/network/interfaces



auto lo

iface lo inet loopback

iface eth0 inet manual

==

Step 24:

The username you will be using to connect to the node.



adduser (clusteruser)



Adding user `clusteruser' ...

Adding new group `clusteruser' (1000) ...

Adding new user `clusteruser' (1000) with group `clusteruser' ...

Creating home directory `/home/clusteruser' ...

Copying files from `/etc/skel' ...

Enter new UNIX password: (nodepasswd)

Retype new UNIX password: (nodepasswd)

passwd: password updated successfully

Changing the user information for clusteruser

Enter the new value, or press ENTER for the default

Full Name []:

Room Number []:

Work Phone []:

Home Phone []:

Other []:

Is the information correct? [Y/n] y

==

Step 25

Get latest svn version 5586 as of this writing.



svn checkout svn://scm.gforge.inria.fr/svn/kerrighed/trunk /usr/src/kerrighed -r 5586



[..]

A /usr/src/kerrighed/NEWS

A /usr/src/kerrighed/linux_version.sh

U /usr/src/kerrighed

Checked out revision 5586.

==

Step 26:

Kerrighed uses linux 2.6.0. Kerrighed ignores any other version.



wget -O /usr/src/linux-2.6.20.tar.bz2 http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.20.tar.bz2 && tar jxf /usr/src/linux-2.6.20.tar.bz2 && cd /usr/src/kerrighed && ./autogen.sh && ./configure && cd kernel && make defconfig



==

Step 27:

Make sure these settings are in place. By default, b), c), d) are

enabled but it wouldn’t hurt if you double-check. a) you have to pick

the network cards of your nodes and make sure they are loadable at boot

time (* not M)

a. Device Drivers -> Network device support --> Ethernet (10 or 100Mbit)

b. File systems -> Network File Systems and enabling NFS file system support,

NFS server support, and Root file system on NFS. Make sure that the NFSv3 options

are also enabled, and again, make sure they are part of the kernel and not loadable

modules (asterisks and not Ms).

c. To enable the scheduler framework, select “Cluster support” --> “Kerrighed

support for global scheduling” --> “Run-time configurable scheduler framework”

(CONFIG_KRG_SCHED_CONFIG). You should also enable the “Compile components needed

to emulate the old hard-coded scheduler” option to mimic the legacy scheduler

(CONFIG_KRG_SCHED_COMPAT). This last option will compile scheduler components

(kernel modules) together with the main Kerrighed module, that can be used to

rebuild the legacy scheduler, as shown below.

d. To let the scheduler framework automatically load components’ modules,

select “Loadable module support” --> “Automatic kernel module loading”

(CONFIG_KMOD). Otherwise, components’ modules must be manually loaded

on each node before components that they provide can be configured.

*/



make menuconfig



==

Step 28:

Kernel compilation with Kerrighed support



cd .. && make kernel && make && make kernel-install && make install && ldconfig



==

Step 29:

Configuring Kerrighed



nano /etc/kerrighed_nodes



session=1 #Value can be 1 -- 254

nbmin=2 #2 nodes starting up with the Kerrighed kernel.

10.11.12.101:1:eth0

10.11.12.102:2:eth0



nano /etc/default/kerrighed



# Start kerrighed cluster

ENABLE=true

#ENABLE=false

# Enable/Disable legacy scheduler behaviour

LEGACY_SCHED=true

#LEGACY_SCHED=false

==

Step 30:

Exit chrooted system



exit



==

Step 31:

Out of your chrooted system copy bootable kernel.



cp -p /nfsroot/kerrighed/boot/vmlinuz-2.6.20-krg /var/lib/tftpboot/



==

Step 32:

Configure the controller to use eth0 card.

eth0 will be used by the DHCP server to feed the nodes.



ifconfig eth0 10.11.12.1

/etc/init.d/tftpd-hpa start

/etc/init.d/dhcp3-server start

/etc/init.d/portmap start

/etc/init.d/nfs-kernel-server start



==

Step 33:

Make sure nodes are connected to the router.

From the controller do:



ssh [email protected]

Then from any connected node as "clusteruser":



krgadm nodes



output:

101:online

102:online

Double-check as root from the node:



tail -f /var/log/messages



node1 kernel: Proc initialisation: done

node1 kernel: EPM initialisation: start

node1 kernel: EPM initialisation: done

node1 kernel: Init Kerrighed distributed services: done

node1 kernel: scheduler initialization succeeded!

node1 kernel: Kerrighed... loaded!

These commands are helpful. Do these as a regular node user "clusteruser".



krgcapset -d +CAN_MIGRATE

krgcapset -k $$ -d +CAN_MIGRATE

krgcapset -d +USE_REMOTE_MEMORY

krgcapset -k $$ --inheritable-effective +CAN_MIGRATE



To monitor your cluster:



top



(toggle 1 to see cpus)

Also:



cat /proc/cpuinfo | grep “model name”

cat /proc/meminfo | grep “MemFree”

cat /proc/stat



==

Step 34:

This is step is needed so you do not have to enter a password when triggering your MPI programs from the node.

If you do not generate a key, you will have to enter the node[n] password manually in order to migrate the processes.

You may not need to enter a password when generating the key. The assumption is that the controller is secure enough from the outside (no rerouting packets from eth1 --the other network card.)

Alternatively, if you feel paranoid you may enter a password then tell ssh-agent to remember it. The password will remembered for that session only.

After you log on to one of the nodes via ssh



ssh-keygen -t dsa (don't enter password)

cp /home/clusteruser/.ssh/id_dsa.pub /home/clusteruser/.ssh/authorized_keys



or



ssh-keygen -t dsa (do enter password)

cp /home/clusteruser/.ssh/id_dsa.pub /home/clusteruser/.ssh/authorized_keys

eval `ssh-agent`

ssh-add /home/clusteruser/.ssh/id_dsa (type in password associated with keys)



==

Step 35 TESTING:

A simple ‘hello world' programs that calls the MPI library.

I will create a config file where MPI can lookup information for running jobs on the cluster.

I am creating this config file on the home directory of the cluster user "clusteruser" --which is the same account we created earlier. It will be readable to the node so you can create the file as your own user from the controller. You can also log on to the any of the nodes where you will be triggering your programs from and create the file there using the "clusteruser" account:

In this situation, I opted for Door A

at controller as a regular user --your regular system username:



nano /nfsroot/kerrighed/home/clusteruser/mpi_file.conf



#Contents of mpi_file.conf. I'm listing the nodes of the cluster.

node1

node2

--------START CODE---------

/*

hello world

This "hello world" program does not deviate much from any other hello world program you have seen before. The only difference is that it has MPI calls.

*/

#include

#include

#include

int main(int argc, char *argv[])

{

char *boxname;

int rank, processes;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &processes);

boxname = (char *)calloc(100,sizeof(char));

gethostname(boxname,100);

printf("



Process: %i

Message: hello cruel world!

Cluster Node: %s



", rank, boxname, processes);

MPI_Finalize();

return(0);

}

--------END CODE---------

On the controller compile your program using the MPI library:



mpicc hello_world.c -o hello_world



Put the MPI program in the user's home directory on one of the nodes.

In this example, I put it in /nfsroot/kerrighed/home/clusteruser:



cp hello_world /nfsroot/kerrighed/home/clusteruser/



open another shell and ssh any of the nodes. Here I log on to node1:



ssh [email protected]



mpirun -np 2 --hostfile mpi_file.conf hello_world



output:

Process: 1

Message: hello cruel world!

Cluster Node: node2

Process: 0

Message: hello cruel world!

Cluster Node: node1

============

Troubleshooting:

============

"PXE-E32: TFTP open timeout" error. It can be either that your network card is not supported or that you have something blocking the way for the TFTP server to distribute the image.

Try booting your node from CD:



cd /tmp

wget http://kernel.org/pub/software/utils/boot/gpxe/gpxe-1.0.0.tar.bz2

bunzip2 gpxe-1.0.0.tar.bz2

tar xvpf gpxe-1.0.0.tar

cd /tmp/gpxe-1.0.0/src/bin/gpxe.iso

make bin/gpxe.iso



Then burn gpxe.iso to a CD and boot the client off of it.

If still no joy try below. It might that something is blocking the way to the TFTP server.

On the controller:



in.tftpd -l

tail -1 /var/log/syslog



recvfrom: Socket operation on non-socket

cannot bind to local socket: Address already in use

solution: you can use the package rcconf to disable dhcp, portmap, nfs server, and tftp-hpa at boot time. Then start manually each server when needed.

If problem persists try disabling firewall settings

(make a backup of existing rules iptables-save > /root/firewall.rules )



iptables -X

iptables -t nat -F

iptables -t nat -X

iptables -t mangle -F

iptables -t mangle -X

iptables -P INPUT ACCEPT

iptables -P FORWARD ACCEPT

iptables -P OUTPUT ACCEPT



[ to restore after you find out what the problem is use iptables-restore < /root/firewall.rules ]

You can also try this:



netstat -anp | grep 69



udp6 0 0 :::69 :::*

note: this output looks suspicious "udp6"?

Connect with any TFTP client from the controller and on a second shell do tail -f /var/log/syslog



tftp 127.0.0.1



tftp> get pxelinux.0

Transfer timed out



tail -f /var/log/syslog



in.tftpd[2881]: received address was not AF_INET, please check your inetd config

inetd[2441]: /usr/sbin/in.tftpd: exit status 0x4c00

note: Check inet.conf file and disable IPv6

To disable IPv6 add these lines to /etc/modprobe.d/aliases



alias net-pf-10 off

alias ipv6 off



Also in /etc/hosts put a comment on these lines:



#::1 localhost ip6-localhost ip6-loopback

#fe00::0 ip6-localnet

#ff00::0 ip6-mcastprefix

#ff02::1 ip6-allnodes

#ff02::2 ip6-allrouters

#ff02::3 ip6-allhosts



Reboot and try again from head node.



tftp 127.0.0.1



tftp> get pxelinux.0

Received 15987 bytes in 0.0 seconds

All ok now, try booting your nodes.

