vSAN is a wonderful shared storage option in a vSphere cluster, but it requires an administrator with deep product knowledge and overall awareness to be able to manage it with an understanding of its quirks and gotchas. I’ve worked with several vSAN clusters composed of many nodes for a few years now but sometimes it still surprises me. I’ve recently spent a couple of hours troubleshooting a “General vSAN Error” to figure out why I couldn’t put a host in Maintenance Mode. Finally I found out that it was done on purpose. I decided to describe my experience to help others to resolve their vSAN issues.
Usually, if I want to check some scenario as quickly as possible, I use one of the VMware Hands On Labs environments, which I reconfigure just as I need it. This time I used “HOL-2008-01-HCI – vSAN – Getting Started”. It is based on the 6.7 version. I know it’s not a current vSAN version, but it is mature enough to use it for testing. I wanted to check how a three-node cluster would behave if I put one of the nodes in Maintenance Mode choosing “Full data migration” as a data evacuation option. A VM which was run in the cluster used “vSAN Default Storage Policy”. The task quickly failed after it started, with an error message “General vSAN error”. I immediately checked if there was enough storage space left on disks of the remaining nodes and there was. A “CORE-A” VM was consuming just 492.1 MB from almost 60 GB of vSAN datastore. Even if I put one host in Maintenance Mode, it would be enough storage space from the remaining two nodes. I decided to confirm this conclusion, so I opened a SSH session to vCenter Server Appliance (vCSA). I ran these commands:
It showed me which percent of storage space was used per node and how these numbers would change after a simulated failure of 1 node. It didn’t look suspicious.
Next, I checked “Task Console” in vSphere Client to find any clues. A description added to the error message confused me:
“Evacuation precheck failed – Retry operation after adding 1 nodes with each node having 1 GB worth of capacity.” and I ignored it without thinking. I dived into kb.vmware.com to find any clues there.
I quickly found this article: “out of resources” error when entering maintenance mode on vSAN hosts with large vSAN objects (2149615).
This got my attention to vSAN’s clomd service. I decided to check /var/log/clomd.log. I opened a SSH session to an ESXi host and found in last four consecutive lines that decommission operation was started and it changed its state as shown below:
Also, I decided to find if there were any known problems with decommissioning nodes from vSAN clusters. I quickly found another article: “vSAN Host Maintenance Mode is in sync with vSAN Node Decommission State (51464)” and I used this recommended command to check if there were any problems in vSAN database with node decommissioning:
The results showed that values for decomState key were equal to zero. It indicated that there weren’t any problems with background decommission operation which froze.
Then, I decided to find any traces in VMware’s community resources. I easily found that my issue was well known and there were some solutions.
In the post titled “A general system error occurred: Operation failed due to a VSAN error. Another host in the cluster is already entering maintenance mode” I found out that I should try to break any Maintenance Mode entering operations using this command:
In order to put a host into Maintenance Mode I should use this command:
I found it useful, but putting a host into Maintenance Mode without data evacuation wasn’t what I was looking for.
Finally, desperately I decided to search the product documentation to find the answers. And my life got easier from the first hit. In vSAN documentation in the article titled “Place a Member of vSAN Cluster in Maintenance Mode” I found this definition of the available data evacuation options:
Ensure accessibility – “This is the default option. When you power off or remove the host from the cluster, vSAN ensures that all accessible virtual machines on this host remain accessible. Select this option if you want to take the host out of the cluster temporarily, for example, to install upgrades, and plan to have the host back in the cluster. This option is not appropriate if you want to remove the host from the cluster permanently.
Typically, only partial data evacuation is required. However, the virtual machine might no longer be fully compliant to a VM storage policy during evacuation. That means, it might not have access to all its replicas. If a failure occurs while the host is in maintenance mode and the Primary level of failures to tolerate is set to 1, you might experience data loss in the cluster.”
And finally the most important note was this one:
“This is the only evacuation mode available if you are working with a three-host cluster or a vSAN cluster configured with three fault domains.”
The rest of the definitions you can read there, but what I read was the explanation I was looking for.
If you use a three-node vSAN cluster and want to put a host in Maintenance Mode to be able to do any service activities, you don’t have an option to fully protect hosted VMs. It can be done by using at least 4 nodes in the cluster.
Remember folks, the old rule “RTFM” still counts!