I had a host fail due to a bad stick of RAM over the weekend and it brought my attention to something I hadn’t noticed before. There are two VM Storage Policies for Horizon View Linked Clone desktop VMs, and one of them (OS_DISK_FLOATING_<hex UUID>) is set to tolerate 0 failures of the cluster. Now, at first I thought it was just a fluke but I confirmed this on another View on VSAN cluster. It appears to be the default policy setting.

First let me cover what it means when it says “failures to tolerate”. It’s a fairly straight-forward concept, basically the number of failures (host, disk group, or netowrking) an object can have and still function. Basically VSAN, for its default value of 1, makes a copy of each VM and stores it elsewhere on the cluster so that should a host go down, or a disk group fail, the VM data is still accessible. You can change this value. If you change this value to 0 and a host fails, you then lose all the data that was on the disk group(s) that failed, causing VMs to orphan.

I get the sentiment here, that View desktops, especially linked clone desktops, should be as stateless as possible. The problem that I see here is that every time I put a host into maintenance mode I’m going to have to wait for a full data migration even if I check “ensure availability” because there is only a single copy of the data. This adds unnecessary time to maintenance jobs. This also guarantees that if a failure occurs, you’re going to lose the desktops that are located on that host. The process that would occur if FTT was set to 1 would be the desktop drops, the user gets kicked from their session, the desktop reboots, and gets re-provisioned. The user then attempts to get connected again and gets a new desktop. I am imagining that the logic here is that it doesn’t matter if the desktop comes back, and it really doesn’t from an end user perspective, they all look the same to them.

The bigger issue here is that the various View databases go crazy when desktops drop out of vCenter. Before the ViewDbChk Fling from VMware there wasn’t really an easy way to handle inconsistencies between the LDAP, vCenter and View Composer databases. You basically had to go through each entry one by one and check if it still existed or not. I usually worked around this by created a naming convention like ViewDesktop0-{n:fixed=3}, and when the databases got out of sync I would change the naming convention to ViewDesktop1-{n:fixed=3} and so on. Now it isn’t so bad to actually clean up the database, though I haven’t used the script yet to confirm how well it works. I’ve heard it’s good, though. This will break composing, however. Composer also doesn’t like it when it can’t reach a host in the cluster it is using. I’m not sure if this is on purpose or not, pending a support request with VMware, though maybe some one can answer.

Unfortunately I don’t have a lot of answers here, just speculation as to what the purpose of these default configurations are. I’m trying to track down information and will update the post as I get more information. Hopefully this was helpful/informative. I guess I just got lucky and had two failures occur relatively close to each other.

EDIT: After a long, painful, and ultimately embarrassing (for me, always check vmkpings before calling GSS) support case with an unrelated VSAN cluster I threw out the question to the VSAN support guy I was working with. He basically confirmed my assumptions on their logic about desktops being as stateless as possible. We discussed it quickly, would like to get more in depth on it. I guess take this as an FYI if you aren’t aware of this. View linked clone desktops are set by default to tolerate zero failures.

Also, in regard to the View pool failing, I forgot that I had the pool set to “Stop provisioning on error” which basically has the pool throw its hands up at the first sign of trouble. That’s why it stopped provisioning instead of just ignoring the host that had suddenly disappeared.