Today the Virtual SAN team at VMware released a content pack for Log Insight. This expands on the fairly basic, single dashboard provided in the vSphere content pack, which is a combination of VSAN and VVOLs, which is really just a glorified “is it running” dashboard. The content pack contains useful information and a way to easily visualize common problems in a VSAN deployment. Since I made a feature request around this on the VMware Log Insight forum I figured I would do a quick run through of what is provided in the content pack and my opinions on it.

Host State Information

The first dashboard covers the state of the hosts themselves. It covers good to know information, such as hosts that are currently in maintenance mode or hosts performing destroy jobs or role initializations. These metrics are really helpful for when there is a problem, but there aren’t any widgets that display metrics for healthy hosts. I believe in having at least some sort of complete representation of the data being ingested, partly so that you know it is working and partly because I like seeing data and the aesthetics of Log Insight are pleasing to the eye. It is also nice to be able to get some reassurance that nothing is going wrong. It is also the first dashboard in the pack, and it is a little disconcerting to see no data show up.

Diskgroup Failures

Diskgroup failures show failures specifically related to diskgroups. This will show things like disks being offline or in permanent failure, suggesting a disk failure. The last two widgets at the bottom of the page show failures to create components based on two common issues, lack of capacity and maximum components reached. This should help diagnose and alert under-provisioned clusters quickly and easily. Be sure to reference the sizing guide for a VSAN cluster so that you don’t have to see any data in these widgets.

Networking

Networking is a fairly basic dashboard that shows network creation and networking connectivity failures. Network connectivity failures are an obvious metric, if your VSAN hosts can’t talk to each other there is a problem. This would have been helpful for me a few weeks ago. Network creation will give you information about hosts that are coming online, which can help with maintaining and auditing the security of your VSAN network. If you see any data in this widget and you aren’t in the middle of performing a deployment or cluster upgrade you probably have a problem.

Congestion

The Congestion dashboard has some great information for diagnosing performance problems. These widgets will tell you if you have any SSD congestion, which I assume in an All-Flash configuration would still just be the caching tier drive. This will tell you if you have under-sized your deployment in regards to your flash to magnetic disk ration, or if you are having a large spike of load on your VSAN cluster. The final widget shows number of events of device latency. Note that this widget will show information for all the disks in a VSAN cluster, not just the disks contributing to the VSAN datastore. This could be useful for locating potentially failing disks which are having issues delivering the IOPS that they are rated for. It could also be used to diagnose disks that are not rated well enough to handle the workload that is on VSAN, and be used to demonstrate the impact of cheaper drives on a VSAN cluster. There is a lot of good information for troubleshooting in this dashboard.

Object Configuration

Next is my favorite dashboard. Object Configurations. This has a lot of really good data that can assist the diagnosis and troubleshooting of various issues that can occur with a VSAN cluster. The widgets have to do with various actions on the objects of the file system. This includes creation, repair, rebalance , and votes rebalance. For those unfamiliar with VSAN I’ll go into some more detail on these different actions. Repairs can occur when a host fails, leading to a rebuild process. Seeing a large spike in repair operations can be an indicator of this. Rebalance actions occur when the disks of a VSAN cluster become full. You can trigger a proactive rebalance from the Ruby vSphere Console, but outside of that you should not be observing to many rebalance operations. If you are observing constant rebalance actions, then that means that your VSAN disk capacity is not large enough to support your environment. This helped me visualize the impact that this is having on a current customer. If you see constant activity on this widget it is time to invest in more disks. Paired with the rebalance widget is the cleanup widget, which displays cleanup operations performed after a rebalance.

Decommissioning

The Decommissioning dashboard is hopefully something you should not have to use very often. As the name implies it displays events related to decommissioning a VSAN cluster. This includes entering maintenance mode, and failing to do so. The decommissioning of disks, and the failure of that. Hopefully you never see this one populated when you don’t expect it to be.

Configuration Failures

The final dashboard is Configuration Failures. These widgets are meant to display issues with the configuration, again as the name implies. These configurations have to do with the VM policies and whether there is enough physical hardware to satisfy the requirements of the policy created. Insufficient fault domains, insufficient space, generation failure, disk assignment, and RAID tree depth are all related to whether or not there are enough hosts or enough disks to handle the VM policy. Fault domains are directly related to the number of hosts in the cluster. To resolve any events found in this widget you will need to add more hosts to increase the number of fault domains. Insufficient space is what it sounds like, and requires more disks, or more hosts with more disks. The same is true of generation failure and disk assignment.

RAID tree depth is kind of a non-obvious term that I’ll explain quickly. This is related to the stripe width of an object. This is set in the VM configuration policy, and dictates how many disks the VM objects should reside on. E.g. if you have a VM and specify the stripe width should be two, then VSAN will stripe the vmdk over two disks.

The final widget is the “Cannot connect to owner” widget, which is related to networking issues. You can see events generated in here when a switch dies, or if you are first configuring the cluster if you don’t have the VLAN settings correct, or the vmkernel assigned for VSAN traffic is not assigned properly.

Overall I think the VSAN content pack is excellent. It provides an excellent way to monitor and alert on VSAN related issues, as well as visualize issues that are impacting your environment. If you want to get even deeper into VSAN monitoring with performance metrics you will need to grab a vRealize Operations Manager deployment and deploy the VSAN content pack for that. You can also package the two products together with vRealize Operations Insight to get the full depth and breadth of your ability to monitor VSAN performance, top to bottom.