4 min read

Introduction

If you have been in IT for while, you probably aware about how power is important for your Datacenter.

I’m not sure about how many times I’ve read or been told that power is the number one cost in a modern datacenter today, but it has been a frequent refrain. Thanks to virtualization that helped us throughout the years to consolidate and reduce power cost. Power is absolutely the fastest growing operational costs of a high-scale service.

Redundancy is critical for every organization. For example, using two power supplies into a server or a Raid system for storage will generally provide enough time for the component to be replaced. This is what known “N+1” approach. However, for systems where failure is just not acceptable, then an “N+M” approach (having more than one extra component in place) may be used.

Within the Datacenter itself, the use of more modular uninterruptible power supplies (UPSs), power filters, generators and air-conditioning with in-built N+1 redundant power supplies, batteries and so on can also be used to increase redundancy and protect your servers.

What about room failure? This would also require building two datacenters within the same building or in different city with the facility services being mirrored across each as N+1 power distribution networks, UPSs, cooling systems and so on. This is, by its very nature, far too expensive.

And the list goes on and on…

Fault Tolerance

We recently deployed a 4-Nodes Storage Spaces Direct using the Hyper-Converged model.

For more information about Storage Spaces Direct (S2D) in Windows Server 2016, please check the overview here.

This technology is really awesome in term of simplicity, performance, fault tolerance, efficiency, manageability and much more.

We are so happy with the results and the performance we get out of 4-Nodes is fantastic.

With four servers we can tolerate up to 2 faults. Here is an example of the six different circumstances in which the system stays online.

1. One drive lost (includes cache drives).

2. One server lost.

3. One server and one drive lost.

4. Two servers lost.

5. Two drives lost in different servers.

6. More than two drives lost, in condition that maximum two servers are affected. In other words, if two drives are lost on “Server 1” and two other drives are lost on “Server 3”, the system stays online.

In every case of the six different scenarios above, all volumes will stay online, in condition that your cluster maintains quorum!

So as you can see, with four servers we have a fairly good fault-tolerant.

Expect The Unexpected

The million-dollar question is what if all servers goes down!!!

Is there really a thing like 3am wake up call to fix a system?

Well, I received that call from one of my colleague that the Storage Spaces Direct cluster is not turning on.

Long story short, we encountered a big sparkle at one of our datacenter and everything tripped down. Half of the power source for the servers is connected to the main power and half to the UPS. The electric sparkle burned the PDUs in the Rack and the power supplies for all servers.

Yes, I know it’s a bad situation to be in… and redundant power supplies won’t even help in this scenario!

We waited until the next day to receive the new power supplies and replaced them.

After replacing the power supplies, we brought all the nodes up at the same time and guess what?

Storage Spaces Direct sustains this failure and we were able to recover. The system came back to normal state, and the resync took around 25 minutes to complete.

Zero Data Loss!!!

I would like to add an additional example to the list mentioned earlier.

7. If all Servers goes down, as if someone remove the power cable, Storage Spaces Direct will recover from complete power loss. (This is my own experience, it’s not supported by Microsoft).

Kudos to Storage Spaces Direct Team!!!

Please note that Resilient File System (ReFS) is Microsoft’s newest file system is recommended to be used with Storage Spaces Direct. ReFS is designed to maximize data availability, scale efficiently to large data sets across diverse workloads, and provide data integrity by means of resiliency to corruption.

Lessons Learned

Is your Disaster Recovery Plan updated and maintained? What about your backup? Are you using Storage Spaces Direct? I strongly recommend you to start evaluating this awesome technology if you’re not doing so already.

The businesses today demand greater availability from their infrastructure. To achieve high uptime, even highly unlikely occurrences such as power failures, rack outages, or natural disasters must be protected against.

For example, to be rack fault tolerant, your servers and your data must be distributed across multiple racks. Look at fault domain awareness in Storage Spaces Direct, which uses fault domains to maximize data safety.

For more information about fault domains in Windows Server 2016, please check the overview here.

Storage Spaces Direct (S2D) and Storage Replica (SR) are better together, look at how you can you achieve maximum protection by combining these technologies together. More information about Storage Replica in Windows Server 2016 here.

A big thank you to all my fellow MVPs and the Microsoft product group who offer their support during this outage.

Hope my real experience will help someone out there.

Thanks for reading!

[email protected]