Cluster Optimisation: Hunting Down CPU Differences

By Pedro Pessoa, Operations Engineer at Server Density.

Published on the 3rd December, 2015.

Notice any unusual activity in your cluster?

The first thing to do is look for any subtle differences between the participating servers. The obvious place to start is—you guessed it—software.

Given the wellspring (anarchy) of apps sitting on most servers, the task of manually tracking down any deltas in software versions can be onerous. Thankfully, the growing adoption of modern config management tools like Puppet and Chef has made this exercise much easier.

Once software inconsistencies are ruled out, the next step is to look at hardware. It’s a classic case of playing detective, i.e. searching for clues and spotting anything out of the ordinary in your infrastructure.

Here is how we do this at Server Density.

Cluster Optimisation: The Process

We do weekly reviews of several performance indicators across our entire infrastructure.

This proactive exercise helps us spot subtle performance declines over time. We can then investigate any issues, schedule time for codebase optimisations and plan for upgrades.

Since we use Server Density to monitor Server Density those reviews are easy. It only takes a couple of minutes to perform this audit, using preset time intervals on our performance dashboards.

The Odd Performance Values

It was during one of those audits—exactly this time last year—when we observed a particularly weird load profile. Here is the graph:

This is a 4 server queue processing cluster which runs on Softlayer with dedicated hardware (SuperMicro, Xeon 1270 Quadcores, 8GB RAM). We’d just finished upgrading those seemingly identical servers.

The entire software stack is built from the same source using Puppet. Our deploy process ensures all cluster nodes run exactly the same versions. So why was one of the servers exhibiting a lower load for the exact same work? We couldn’t justify the difference.

With the software element taken care of (config management), we turned our attention to hardware and got in touch with Softlayer support.

“There are no discernible differences between the servers,” was their response.

The Plot Thickens

Feeling uneasy about running servers that should behave the same and are not, we decided to persevere with our investigation. Soon we discovered another, more worrying, issue: packet loss on the 3 servers with the higher load.

Armed with those screenshots, we went straight back to Softlayer support.

They were quite diligent and “looked at the switch/s for these servers, network speed, connections Established & Waiting, apache/python/tornado process etc…”

Even so, they came back empty-handed. Except . . . for a subtle difference on the cluster hardware:

“all of the processors are Xeon 1270 Quadcores, -web4 is running V3 and is the newest; -web2 and -web3 is running V2; -web1 is running V1“.

Smoking Gun

When ordering new servers, we get to pick the CPU type, but not the CPU version. As it turns out, the datacenter team provides whatever CPU version they happen to have “in stock”.

We now knew what to look for.

After some further inspections we spotted several potentially interesting differences among CPU versions throughout our infrastructure. We decided to eliminate all of them and see what happens.

Softlayer is good at accommodating such special requests and we had no difficulty in getting this one through.

The following graph shows the replacement of -web1 and then -web2 and -web3. Can you spot the improvement?

Here is a similar plot for cluster packet loss:

It could be that the CPU version was incompatible with the hardware drivers, or a whole host of other issues obscured beneath that CPU version delta. Switching all the servers to a consistent CPU version solved the problem. All packet loss disappeared and performance equalised.

Summary – What We Learned

Consistency within clusters is a good thing to have. Specifically: