The term "cluster" is actually not very well defined and could mean different things to different people. According to Webopedia, cluster refers to a group of disk sectors. Most Windows users are probably familiar with lost clusters--something that can be rectified by running the defrag utility. However, at a more advanced level in the computer industry, cluster usually refers to a group of computers connected together so that more computer power, e.g., more MIPS (millions instruction per second), can be achieved or higher availability (HA) can be obtained.

Beowulf, Super Computer for the "Poor" Approach Most super computers in the world are built on the concept of parallel processing--high-speed computer power is achieved by pulling the power from each individual computer. Made by IBM, "Deep Blue", the super computer that played chess with the world champion Garry Kasprov, was a computer cluster that consisted of several hundreds of RS6000s. In fact, many big time Hollywood movie animation companies, such as Pixar, Industrial Light and Magic, use computer clusters extensively for rendering (a process to translate all the information such as color, movement, physical properties, etc., into a single frame of picture). In the past, a super computer was an expensive deluxe item that only few universities or research centers could afford. Started at NASA, Beowulf is a project of building clusters with "off-the-shelf" hardware (e.g., Pentium PCs) running Linux at a very low cost. In the last several years, many universities world-wide have set up Beowulf clusters for the purpose of scientific research or simply for exploration of the frontier of super computer building.

High Availability (HA) Cluster Clusters in this category use various technologies to gain an extra level of reliability for a service. Companies such as Red Hat, TurboLinux and PolyServe have cluster products that would allow a group of computers to monitor each other; when a master server (e.g., a web server) goes down, a secondary server will take over the services, similar to "disk mirroring" among servers.

Simple Theory Because I do not have access to more than one real (or public) IP address, I set up my two-node cluster in a private network environment with some Linux servers and some Win9x workstations. If you have access to three or more real/public IP addresses, you can certainly set up the Linux cluster with real IP addresses. In the above network diagram (fig1.gif), the Linux router is the gateway to the Internet, and it consists of two IP addresses. The real IP, 24.32.114.35, is attached to a network card (eth1) in the Linux router and should be connected to either an ADSL modem or a cable modem for internet access. The two-node Linux router consists of node1 (192.168.1.2) and node2 (192.168.1.3). Depending on your setup, either node1 or node2 can be your primary server, and the other will be your backup server. In this example, I will choose node1 as my primary and node2 as my backup. Once the cluster is set, with IP aliasing (read IP aliasing from the Linux Mini HOWTO for more detail), the primary server will be running with an extra IP address (192.168.1.4). As long as the primary server is up and running, services (e.g., DHCP, DNS, HTTP, FTP, etc.) on node1 can be accessed by either 192.168.1.2 or 192.168.1.4. In fact, IP aliasing is the key concept for setting up this two-node Linux cluster. When node1 (the primary server) goes down, node2 will be take over all services from node1 by starting the same IP alias (192.168.1.4) and all subsequent services. In fact, some services can co-exist between node1 and node2 (e.g., FTP, HTTP, Samba, etc.), however, a service such as DCHP can have only one single running copy on the same physical segment. Likewise, we can never have two identical IP addresses running on two different nodes in the same network. In fact, the underlining principle of a two-node, high-availability cluster is quite simple, and people with some basic shell programming techniques could probably write a shell script to build the cluster. We can set up an infinite loop within which the backup server (node2) simply keeps pinging the primary server, if the result is unsuccessful, and then start the floating IP (192.168.1.4) as well as the necessary dæmons (programs running at the background).

A Two-Node Linux Cluster HOWTO with "Heartbeat" You need two Pentium class PCs with a minimum specification of a 100MHz CPU, 32MB RAM, one NIC (network interface card), 1G hard drive. The two PCs need not be identical. In my experiment, I used an AMD K6 350M Hz and a Pentium 200 MMX. I chose the AMD as my primary server as it can complete a reboot (you need to do a few reboots for testing) faster than the Pentium 200. With the great support of CFSL (Computers for Schools and Libraries) in Winnipeg, I got some 4GB SCSI hard drives as well as some Adaptec 2940 PCI SCSI controllers. The old and almost obsolete equipment is in good working condition and is perfect for this experiment. node1 AMD K6 350MHz cpu

4G SCSI hard drive (you certainly can use IDE hard drive)

128MB RAM

1.44 Floppy drive

24x CD-ROM (not needed after installation)

3COM 905 NIC node2 Pentium 200 MMX

4G SCSI hard drive

96MB RAM

1.44 Floppy

24x CD-ROM

3COM 905 NIC

The Necessary Software Both node1 and node2 must have Linux installed. I chose Red Hat and installed Red Hat 7.2 on node1 and Red Hat 6.2 on node2 (I simply wanted to find out if we could build a cluster with different versions of Linux installed on different nodes). Make sure you have installed all dæmons that you want to support. Here is my installation detail: Hard disk partitions: 128MB for swap and the rest mounted for "/" (so that you don't need to worry about whether there is too much or not enough for a certain subdirectory). Installed Packages: Apache

FTP

Samba

DNS

dhcpd (server)

Squid

Heartbeat Heartbeat is a part of Ultra Monkey (The Linux HA Project), and the RPM can be downloaded from www.UltraMonkey.org. The download is small and RPM installation is smooth and simple. However, the document or HOWTO for configuration is hard to find and confusing. In fact, that is the reason I decided to write this HOWTO; so that hopefully you can get your cluster setup with less problems.

Setting Up the Primary Server (node1) and the Backup Server (node2) It is not the purpose of this article to show you how to install Red Hat; a lot of excellent documentation can be found at either www.linuxdoc.org or www.redhat.com. I will simply include some of the most important configuration files for your reference: /etc/hosts 127.0.0.1 localhost 192.168.1.1 router 192.168.1.2 node1 192.168.1.3 node2 This file should be the same on both node1 and node2; you may add any other nodes as you see fit. Check HOSTNAME (cat /etc/HOSTNAME) and make sure it returns either node1 or node2. If not, you can use this command (uname -n > /etc/HOSTNAME) to fix the hostname problem. ifconfig for node1 eth0 Link encap:Ethernet HWaddr 00:60:97:9C:52:28 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:18617 errors:0 dropped:0 overruns:0 frame:0 TX packets:14682 errors:0 dropped:0 overruns:0 carrier:0 collisions:3 txqueuelen:100 Interrupt:10 Base address:0x6800 eth0:0 Link encap:Ethernet HWaddr 00:60:97:9C:52:28 inet addr:192.168.1.4 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:10 Base address:0x6800 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:3924 Metric:1 RX packets:38 errors:0 dropped:0 overruns:0 frame:0 TX packets:38 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 Please notice that eth0:0 shows the IP aliasing with IP 192.168.1.4. ifconfig for node2 eth0 Link encap:Ethernet HWaddr 00:60:08:26:B2:A4 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:15673 errors:0 dropped:0 overruns:0 frame:0 TX packets:17550 errors:0 dropped:0 overruns:0 carrier:0 collisions:2 txqueuelen:100 Interrupt:10 Base address:0x6700 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:3924 Metric:1 RX packets:142 errors:0 dropped:0 overruns:0 frame:0 TX packets:142 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0

Install the Heartbeat RPM If you are using Internet Explorer on Windows, you might have problems accessing FTP (Netscape works much better). I suggest you either use a command-line FTP or an FTP Windows/X Window System client (e.g., wu_ftp) to access the FTP site of Ultra Monkey (ftp.UltraMonkey.org). Once you log in to the FTP server of Ultra Monkey, go to pub, then UltraMonkey and then the latest version 1.0.2 (not the beta). The only package is heartbeat-0.4.9-1.um.1.i386.rpm; save heartbeat-0.4.9-1.um.1.i386.rpm on your Linux box, log in as root and install it with rpm -ivh heartbeat-0.4.9-1.um.1.i386.rpm

Null Modem Cable, Crossover Cable, Second NIC According to the accompanying documentation, you need to install a second NIC on both nodes and connect them with a cross overcable. Besides the second NIC, a null modem cable connecting the serial (com) ports of each node is mandatory (according to the documentation). I followed the instructions in the documentation and installed everything. However, as I did more tests on the cluster, I found that the null modem cable, crossover cable and the second NIC are optional; they are nice to have but definitely not mandatory. Configuring Heartbeat is the most important part of the whole installation and must be set up correctly to get your cluster working. Moreover, it should be identical on both nodes. There are three configuration files, all stored under /etc/ha.d: ha.cf, haresource and aythkeys. My /etc/ha.d/ha.cf debugfile /var/log/ha-debug # # File to write other messages to # logfile /var/log/ha-log # # Facility to use for syslog()/logger # logfacility local0 # # keepalive: how many seconds between heartbeats # keepalive 2 # # deadtime: seconds-to-declare-host-dead # deadtime 10 udpport 694 # # What interfaces to heartbeat over? # udp eth0 # node atm1 node cluster1 # # ------> end of ha.cf Whatever is not shown above, you can simply leave as it was (all commented out by the #). The last three options are most important: udp eth0 # node atm1 node cluster1 Unless you have a cross cable, you should use your eth0 (your only NIC) for udp; the two nodes at the end of the above files must be the same as returned by uname -n from each node. My /etc/ha.d/haresources atm1 IPaddr::192.168.1.4 httpd smb dhcpd This is the only line you need; in the above example, I included httpd, smb and dhcpd. You may add as many dæmons as you want, provided they have the exact same spelling as those dæmons under /etc/rc.d/init.d My /etc/ha.d/authkeys You don't need to add anything to this file, but you have to issue the command chmod 600 /etc/ha.d/authkeys

Start the Heartbeat Daemon You may start the dæmon with service heartbeat start or /etc/rc.d/init.d/heartbeat start Once heartbeat is started on both nodes, you will find that the ifconfig from the primary server will return something like: node1 ifconfig for node1 eth0 Link encap:Ethernet HWaddr 00:60:97:9C:52:28 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:18617 errors:0 dropped:0 overruns:0 frame:0 TX packets:14682 errors:0 dropped:0 overruns:0 carrier:0 collisions:3 txqueuelen:100 Interrupt:10 Base address:0x6800 eth0:0 Link encap:Ethernet HWaddr 00:60:97:9C:52:28 inet addr:192.168.1.4 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interrupt:10 Base address:0x6800 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:3924 Metric:1 RX packets:38 errors:0 dropped:0 overruns:0 frame:0 TX packets:38 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 When you see the line eth0:0, heartbeat is working, and you can try to access the server by using http://192.168.1.4 and check the log files /var/log/ha-log. Also, check the log file on node2 (192.168.1.3) and try ps -A | grep dhcpd and you should find no running dhcpd on node2. Now, the real HA test. Reboot, and then shut down the primary server (node1: 192.168.1.2). Don't just power down the server; make sure you issue reboot or press CTL-ALT-DEL and wait until everything is shut down properly before you turn off your PC. Within ten seconds, go to node2 and try ifconfig. If you can get the IP aliasing eth0:0, you are in business and have a working HA two-node cluster. eth0 Link encap:Ethernet HWaddr 00:60:08:26:B2:A4 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:15673 errors:0 dropped:0 overruns:0 frame:0 TX packets:17550 errors:0 dropped:0 overruns:0 carrier:0 collisions:2 txqueuelen:100 Interrupt:10 Base address:0x6700 eth0:0 Link encap:Ethernet HWaddr 00:60:08:26:B2:A4 inet addr:192.168.1.4 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:15673 errors:0 dropped:0 overruns:0 frame:0 TX packets:17550 errors:0 dropped:0 overruns:0 carrier:0 collisions:2 txqueuelen:100 Interrupt:10 Base address:0x6700 You can try ps -A | grep dhcpd or you can try to release and renew the IP info on your Win9x workstation, and you should see the new address for the dhcpd server.