Why Clusters Usually Don’t Work

It’s widely regarded that to solve reliability problems you can just install a cluster. It’s quite obvious that if instead of having one system of a particular type you have multiple systems of that type and a cluster configured such that broken systems aren’t used then reliability will increase. Also in the case of routine maintenance a cluster configuration can allow one system to be maintained in a serious way (EG being rebooted for a kernel or BIOS upgrade) without interrupting service (apart from a very brief interruption that may be needed for resource failover). But there are some significant obstacles in the path of getting a good cluster going.

Buying Suitable Hardware

If you only have a single server that is doing something important and you have some budget for doing things properly then you really must do everything possible to keep it going. You need RAID storage with hot-swap disks, hot-swap redundant PSUs, and redundant ethernet cables bonded together. But if you have redundant servers then the requirement for making one server reliable is slightly reduced.

Hardware is getting cheaper all the time, a Dell R300 1RU server configured with redundant hot-plug PSUs, two 250G hot-plug SATA disks in a RAID-1 array, 2G of RAM, and a dual-core Xeon Pro E3113 3.0GHz CPU apparently costs just under $2,800AU (when using Google Chrome I couldn’t add some necessary jumper cables to the list so I couldn’t determine the exact price). So a cluster of two of them would cost about $5,600 just for the servers. But a Dell R200 1RU server with no redundant PSUs, a single 250G SATA disk, 2G of RAM, and a Core 2 Duo E7400 2.8GHz CPU costs only $1,048.99AU. So if a low end server is required then you could buy two R200 servers that have no redundancy built in which cost less than a single server that has hardware RAID and redundant PSUs. Those two servers have different sets of CPU options and probably other differences in the technical specs, but for many applications they will probably both provide more than adequate performance.

Using a server that doesn’t even have RAID is a bad idea, a minimal RAID configuration is a software RAID-1 array which only requires an extra disk per server. That takes the price of a Dell R200 to $1,203. So it seems that two low-end 1RU servers from Dell that have minimal redundancy features will be cheaper than a single 1RU server that has the full set of features. If you want to serve static content then that’s all you need, and a cluster can save you money on hardware! Of course we can debate whether any cluster node should be missing redundant hot-plug PSUs and disks. But that’s not an issue I want to address in this post.

Also serving static content is the simplest form of cluster, if you have a cluster for running a database server then you will need a dual-attached RAID array which will make things start to get expensive (or software for replicating the data over the network which is difficult to configure and may be expensive), so while a trivial cluster may not cost any extra money a real-world cluster deployment is likely to add significant expense.

My observation is that most people who implement clusters tend to have problems getting budget for decent hardware. When you have redundancy via the cluster you can tolerate slightly less expected uptime from the individual servers. While we can debate about whether a cluster member should have redundant PSUs and other expensive features it does seem that using a cheap desktop system as a cluster node is a bad idea. Unfortunately some managers think that a cluster solves the reliability problem and therefore you can just use recycled desktop systems as cluster nodes, this doesn’t give a good result.

Even if it is agreed that server class hardware is used for all servers so features such as ECC RAM are used you will still have problems if someone decides to use different hardware specs for each of the cluster nodes.

Testing a Cluster

Testing a non-clustered server or some servers that use a load-balancing device at the front-end isn’t that difficult in concept. Sure you have lots of use cases and exception conditions to test, but they are all mostly straight-through tests. With a cluster you need to test node failover at unexpected times. When a node is regarded as having an inconsistent state (which can mean that one service it runs could not be cleanly shutdown when it was due to be migrated) it will need to be rebooted which is sometimes known as a STONITH. A STONITH event usually involves something like IPMI to cut the power or a command such as “reboot -nf“, this loses cached data and can cause serious problems for any application which doesn’t call fsync() as often as it should. It seems likely that the vast majority of sysadmins run programs which don’t call fsync() often enough, but the probability of losing data is low and the probability of losing data in a way that you will notice (IE it doesn’t get automatically regenerated) is even lower. The low probability of data loss due to race conditions combined with the fact that a server with a UPS and redundant PSUs doesn’t unexpectedly halt that often means that problems don’t get found easily. But when clusters have problems and start calling STONITH the probability starts increasing.

Getting cluster software to work in a correct manner isn’t easy. I filed Debian bug #430958 about dpkg (the Debian package manager) not calling fsync() and thus having the potential to leave systems in an inconsistent or unusable state if a STONITH happened at the wrong time. I was inspired to find this problem after finding the same problem with RPM on a SUSE system. The result of applying a patch to call fsync() on every file was bug report #578635 about the performance of doing so, the eventual solution was to call sync() after each package is installed. Next time I do any cluster work on Debian I will have to test whether the sync() code seems to work as desired.

Getting software to work in a cluster requires that not only bugs in system software such as dpkg be fixed, but also bugs in 3rd party applications and in-house code. Please someone write a comment claiming that their favorite OS has no such bugs and the commercial and in-house software they use is also bug-free – I could do with a cheap laugh.

For the most expensive cluster I have ever installed (worth about 4,000,000 UK pounds – back when the pound was worth something) I was not allowed to power-cycle the servers. Apparently the servers were too valuable to be rebooted in that way, so if they did happen to have any defective hardware or buggy software that would do something undesirable after a power problem it would become apparent in production rather than being a basic warranty or patching issue before the system went live.

I have heard many people argue that if you install a reasonably common OS on a server from a reputable company and run reasonably common server software then the combination would have been tested before and therefore almost no testing is required. I think that some testing is always required (and I always seem to find some bugs when I do such tests), but I seem to be in a minority on this issue as less testing saves money – unless of course something breaks. It seems that the need for testing systems before going live is much greater for clusters, but most managers don’t allocate budget and other resources for this.

Finally there is the issue of testing issues related to custom code and the user experience. What is the correct thing to do with an interactive application when one of the cluster nodes goes down and how would you implement it at the back-end?

Running a Cluster

Systems don’t just sit there without changing, you have new versions of the OS and applications and requirements for configuration changes. This means that the people who run the cluster will ideally have some specialised cluster skills. If you hire sysadmins without regard to cluster skills then you will probably end up not hiring anyone who has any prior experience with the cluster configuration that you use. Learning to run a cluster is not like learning to run yet another typical Unix daemon, it requires some differences in the way things are done. All changes have to be strictly made to all nodes in the cluster, having a cluster fail-over to a node that wasn’t upgraded and can’t understand the new data is not fun at all!

My observation is that the typical experience of having a team of sysadmins who have no prior cluster experience being hired to run a cluster usually involves “learning experiences” for everyone. It’s probably best to assume that every member of the team will break the cluster and cause down-time on at least one occasion! This can be alleviated by only having one or two people ever work on the cluster and having everyone else delegate cluster work to them. Of course if something goes wrong when the cluster experts aren’t available then the result is even more downtime than might otherwise be expected.

Hiring sysadmins who have prior experience running a cluster with the software that you use is going to be very difficult. It seems that any organisation that is planning a cluster deployment should plan a training program for sysadmins. Have a set of test machines suitable for running a cluster and have every new hire install the cluster software and get it all working correctly. It’s expensive to buy extra systems for such testing, but it’s much more expensive to have people who lack necessary skills try and run your most important servers!

The trend in recent years has been towards sysadmins not being system programmers. This may be a good thing in other areas but it seems that in the case of clustering it is very useful to have a degree of low level knowledge of the system that you can only gain by having some experience doing system coding in C.

It’s also a good idea to have a test network which has machines in an almost identical configuration to the production servers. Being able to deploy patches to test machines before applying them in production is a really good thing.

Conclusion

Running a cluster is something that you should either do properly or not at all. If you do it badly then the result can easily be less uptime than a single well-run system.

I am not suggesting that people avoid running clusters. You can take this post as a list of suggestions for what to avoid doing if you want a successful cluster deployment.