Over the past nine or so months me and my colleagues have been investigating Hyper-converged systems running software-defined storage to take the next step in our virtualization initiative.

We are very close to reaching 100% x86 virtualization across all our datacenters by decommissioning and P2Ving all legacy physical servers, and I believe hyper-converged infrastructure and software-defined storage is the next logical step for us to reach our goal of the totally software-defined and automated datacenter. The main driver for this journey was to lower the cost of maintaining and running storage, and eliminate unneeded complexity in our infrastructure. Enterprise storage is going to have a shake-up over the next upcoming years.

During this phase we investigated many, if not all, the solutions out there in various forms of testing and I would like to share what you should look for when running your own investigations. I will leave the list vendor agnostic until the end, these are my own opinions as I do not work for a vendor and this is based purely on my thoughts as a customer.

The list is long, but hopefully it will save you some time and help you during your investigation and POC phase.

Gartner is key to making decisions, so how is the solution represented on Gartner’s new magic quadrant for converged systems? If they are listed as a leader, are they a true hyper-converged solution or are they what I would consider version 1 of converged systems? Eg a rack of pre built and tested legacy components that work together well, but do not prevent fork-lift updates.

Will the vendor allow you to power off a node while testing? How about two or even three nodes simultaneously? How about pulling disks out while doing this also? What happens to running VM’s while this happens? How hesitant do they look when you suggest these tests? Remember, data integrity is ALWAYS more important than performance. You do not want to be restoring from backups due to corrupt data because you lost power to an entire rack or datacentre. Even if the restore process is easy, there are still applications that require an RPO of zero and data should never be at risk of corruption on enterprise storage

Are customer references available on request at the same scale or industry that you work in? Take these calls. Do the customer’s feel enthusiastic about the product?

How easy is it to install new nodes? Can you easily scale up or even down, when new projects come on board or systems are decommissioned? Can the entire node installation or cluster installation process be automated? Or are there a set of manual tasks that must be completed after the first cluster is configured?

How easy is it to train your operations team on the product? Does it require a lengthy set of complex standard operating procedures?

How many manual tasks can be eliminated with the product? A business case presented to the stakeholders showing savings in operational hours and tasks should be included during a decision to show true TCO. Make sure to involve the teams responsible for managing your existing infrastructure, and have them list daily, weekly and monthly tasks on current infrastructure.

How well is the support team rated? Is it easy to open support tickets and keep track of the nodes you have purchased? Does the support team have experts with experience on the entire converged infrastructure stack? , Eg hypervisor, network, even application tuning?

Are multiple support tiers available? Consider the need for a four hour response time if the solution is robust and has no single points of failure.

Can the solution automatically notify the vendor of any hardware or software issues like a traditional SAN? Eg disk failure that has not been replaced after a set amount of days. Does the solution support standard monitoring protocols such as SNMP?

How is logging handled? Can logs be forwarded to a centralised logging solution if this is required for compliance?

How well does the product work with existing enterprise backup solutions? are features such as CBT and NBD transport supported?

Ask if you can sign an NDA and ask to look at the companies product roadmap. Do you see innovation coming in the future from the company? Also, don’t just check the future roadmap. Check the roadmap from previous years to see if upcoming features were actually delivered on time. If they’re weren’t, ask why they were delayed. This is great if the features required extra QA, as I believe a product should be stable before release. If features in previous roadmaps have been removed completely in future roadmaps, ask why. Especially if they were announced as hot features in product marketing material.

Does the solution support multiple hypervisors? How easy is it switch from one hypervisor to another? This may not be an important feature depending on the customer.

Are updates non-disruptive and non-destructive? Is the product dependant on underlying hardware which could exclude you from future updates? How automated is the update process? Do updates require hunting down firmware and upgrade packages from various websites?

Can storage pools be grown and shrunk independently of compute clusters? This is important for infrastructure workloads as storage capacity, compute, and memory requirements will grow differently depending on the application.

Can different node types be mixed and matched in clusters? Eg storage heavy nodes mixed with compute heavy nodes.

Does the solution support dedupe and compression? Can these features be disabled or tuned per VM or per datastore? Remember some applications will benefit more from data locality over distributed deduplication.

How does the product communicate with the underlying hypervisor? Does it require dependencies on underlying infrastructure components such as vcenter or scvmm? This can cause issues when planning hypervisor updates.

How easily can the product be managed and monitored? Is the gui intuitive and does it not require separate documentation just to manage it? How easy is it to troubleshoot any performance related issues?

What is the impact If a node fails? How long will you be unprotected for as the cluster rebalances itself? Are there any performance implications that must be designed for?

Can the solution easily be sized as a common platform for ALL infrastructure workloads? What about monster VM’s with extremely large storage requirements that exceed the aggregate capacity of a single node?

Does the solution have API support so you can tie it into your existing automation initiatives?

So, in case you didn’t guess we decided to go with Nutanix for our new Hyper-converged platform. I have had my eye on them for a while after a lot of top VCDX’s began working for them, and I wondered what they were all about. After some research I discovered Steven Poitras’ (Solutions Architect at Nutanix) blog and began reading, it is here if you want a technical deepdive into how Nutanix works.

http://stevenpoitras.com/the-nutanix-bible/

This got me thinking, if a company is willing to show me as a technical guy how the product actually works in the open and how data integrity and availability is maintained, while risking other companies following their formula, they must be on to something good and are confident in the reliability of their product. This is something I have never seen in the industry before.

So our reasons for choosing Nutanix were simple, it was the only solution so far that ticked all the boxes in the requirements we had. So far, I been impressed with the level of support that we have had from Nutanix, and we have some big upcoming infrastructure projects running purely on Nutanix that I hope to blog my experience on. I recommend if you want to try them out, to give your local SE a call and ask for a POC. They will be happy to help.