*Update to include corruption detection script, and better KB on endurance and size requirements for boot devices also updated it for vSphere 7 guidance*

I get a lot of questions about embedded installations of VMware vSAN.

Cormac has written some great advice on this already.

This KB explains how to increase the crash dump partition size for customers with over 512GB of RAM.

vSAN trace file placement is discussed by Cormac here.

Given that vSAN does not support running VMFS on the same RAID controller used for pass thru this often causes customers to look at embedded ESXi installs. Today a lot of deployments are done using embedded SD cards because they support a basic RAID 1 mirror system.

The issue

While not a vSAN issue directly this issue can impact vSAN customers. We have identified this issue on non-vsan hosts.

GSS has seen challenges with lower quality SD cards exhibiting significantly higher failure rates as bad batches in the supply chain have caused cascading failures in clusters. VMware has researched the issue and found that a amplification of reads is making the substandard parts fail quicker. Note the devices will not outright fail, but can be detected by running a hash of the first 20MB repeatedly and getting different results. This issue is commonly discovered on a reboot. As a result of this in 6.0U3 we have a method of redirecting the VMTools to a RAMDisk as this was found to be the largest source of reads to the embedded install. The process for setting this as follows.

Prevention

Log into each host using an SSH connection and set the ToolsRamdisk option to “1”:

1. esxcli system settings advanced set -o /UserVars/ToolsRamdisk -i 1

2. Reboot the ESXi host

3. Repeat for remaining hosts in the cluster.

Thanks to GSS/Engineering for hunting this issue down and getting this work around out. More information can be found on the KB here. As a proactive measure I would recommend all embedded SD card and USB device deployments use this flag, as well as any environment that seeks faster VMTools performance.

Detection

What if you do not know if you are impacted by this issue? William Lam has written this great script that will check the MD5 hash of the first 20MB in 3 passes, to detect if you are impacted by this issue. (Thanks to Dan Barr for testing).

Going forward I expect to see more deployments with High endurance SATADOM devices, as well as in future server designs embedded M.2 slots for boot devices becoming more common and SD cards retired as the default option. While these devices may lack redundancy I would expect a higher MTBF for one of these than a pair of low quality/cost SD cards. The lack of end to end nexus checking on embedded devices vs a full drive also contribute to this. Host profiles and configuration backups can mitigate a lot of the challenges of rebuilding one in the event of a failure.

Mitigation

Check out this KB for how to Backup your ESXi configuration (somewhere other than the local device).

Evacuate the host swap in the new device with a fresh install and restore the configuration.

Looking for a new Boot Device?

Although a 1GB USB or SD device suffices for a minimal installation, you should use a 4GB or larger device. The extra space will be used for an expanded coredump partition on the USB/SD device. Use a high quality USB flash drive of 16GB or larger so that the extra flash cells can prolong the life of the boot media, but high quality drives of 4GB or larger are sufficient to hold the extended coredump partition. See Knowledge Base article http://kb.vmware.com/kb/2004784.

read the new vSphere 7 boot device guidance. Embedded SD/USB installs should be viewed as a legacy option, and more performance and endurance capible larger devices should be considered.

Looking for guidance on what the endurance and size you need for an embedded boot device (as well as vSAN advice?). Check out KB2145210 that breaks out what different use cases need.