Browsed by
Tag: vsan

General vSAN Error

General vSAN Error

vSAN is a wonderful shared storage option in a vSphere cluster, but it requires an administrator with deep product knowledge and overall awareness to be able to manage it with an understanding of its quirks and gotchas. I’ve worked with several vSAN clusters composed of many nodes for a few years now but sometimes it still surprises me. I’ve recently spent a couple of hours troubleshooting a “General vSAN Error” to figure out why I couldn’t put a host in Maintenance Mode. Finally I found out that it was done on purpose. I decided to describe my experience to help others to resolve their vSAN issues.

Usually, if I want to check some scenario as quickly as possible, I use one of the VMware Hands On Labs environments, which I reconfigure just as I need it. This time I used “HOL-2008-01-HCI – vSAN – Getting Started”. It is based on the 6.7 version. I know it’s not a current vSAN version, but it is mature enough to use it for testing. I wanted to check how a three-node cluster would behave if I put one of the nodes in Maintenance Mode choosing “Full data migration” as a data evacuation option. A VM which was run in the cluster used “vSAN Default Storage Policy”. The task quickly failed after it started, with an error message “General vSAN error”. I immediately checked if there was enough storage space left on disks of the remaining nodes and there was. A “CORE-A” VM was consuming just 492.1 MB from almost 60 GB of vSAN datastore. Even if I put one host in Maintenance Mode, it would be enough storage space from the remaining two nodes. I decided to confirm this conclusion, so I opened a SSH session to vCenter Server Appliance (vCSA). I ran these commands:

rvc administrator@corp.local@vcsa-01a.corp.local
vsan.whatif_host_failures -s 1/RegionA01/computers/RegionA01-COMP01/

It showed me which percent of storage space was used per node and how these numbers would change after a simulated failure of 1 node. It didn’t look suspicious.

Next, I checked “Task Console” in vSphere Client to find any clues. A description added to the error message confused me: “Evacuation precheck failed – Retry operation after adding 1 nodes with each node having 1 GB worth of capacity.” and I ignored it without thinking. I dived into kb.vmware.com to find any clues there.
I quickly found this article: “out of resources” error when entering maintenance mode on vSAN hosts with large vSAN objects (2149615).
This got my attention to vSAN’s clomd service. I decided to check /var/log/clomd.log. I opened a SSH session to an ESXi host and found in last four consecutive lines that decommission operation was started and it changed its state as shown below:

DECOM_STATE_NONE 
DECOM_STATE_ACTIVE
DECOM_STATE_AUDIT
DECOM_STATE_FAILED
DECOM_STATE_NONE

Also, I decided to find if there were any known problems with decommissioning nodes from vSAN clusters. I quickly found another article: “vSAN Host Maintenance Mode is in sync with vSAN Node Decommission State (51464)” and I used this recommended command to check if there were any problems in vSAN database with node decommissioning:

cmmds-tool find -t NODE_DECOM_STATE -f json | grep ‘uuid\|decomState’

The results showed that values for decomState key were equal to zero. It indicated that there weren’t any problems with background decommission operation which froze.

Then, I decided to find any traces in VMware’s community resources. I easily found that my issue was well known and there were some solutions.
In the post titled “A general system error occurred: Operation failed due to a VSAN error. Another host in the cluster is already entering maintenance mode” I found out that I should try to break any Maintenance Mode entering operations using this command:

localcli vsan maintenancemode cancel

In order to put a host into Maintenance Mode I should use this command:

localcli system maintenanceMode set -e true -m noAction

I found it useful, but putting a host into Maintenance Mode without data evacuation wasn’t what I was looking for.

Finally, desperately I decided to search the product documentation to find the answers. And my life got easier from the first hit. In vSAN documentation in the article titled “Place a Member of vSAN Cluster in Maintenance Mode” I found this definition of the available data evacuation options:

Ensure accessibility – “This is the default option. When you power off or remove the host from the cluster, vSAN ensures that all accessible virtual machines on this host remain accessible. Select this option if you want to take the host out of the cluster temporarily, for example, to install upgrades, and plan to have the host back in the cluster. This option is not appropriate if you want to remove the host from the cluster permanently.
Typically, only partial data evacuation is required. However, the virtual machine might no longer be fully compliant to a VM storage policy during evacuation. That means, it might not have access to all its replicas. If a failure occurs while the host is in maintenance mode and the Primary level of failures to tolerate is set to 1, you might experience data loss in the cluster.”

And finally the most important note was this one:

“This is the only evacuation mode available if you are working with a three-host cluster or a vSAN cluster configured with three fault domains.”

The rest of the definitions you can read there, but what I read was the explanation I was looking for.

If you use a three-node vSAN cluster and want to put a host in Maintenance Mode to be able to do any service activities, you don’t have an option to fully protect hosted VMs. It can be done by using at least 4 nodes in the cluster.

Remember folks, the old rule “RTFM” still counts!

VSAN real capacity utilization

VSAN real capacity utilization

There are a few caveats that make the calculation and planning of VSAN capacity tough and gets even harder when you try to map it with real consumption on the VSAN datastore level.

  1. VSAN disks objects are thin provisioned by default.
  2. Configuring full reservation of storage space through Object Space Reservation rule in Storage Policy, does not mean

disk object block will be inflated on a datastore. This only means the space will be reserved and showed as used in VSAN Datastore Capacity pane.

Which makes it even harder to figure out why size of “files” on this datastore is not compliant with other information related to capacity.

  1. In order to plan capacity you need to include overhead of Storage Policies. Policies – as I haven’t met an environment which would use only one for all kinds of workloads. This means that planning should start with dividing workloads for different groups which might require different levels of protections.
  1. Apart from disks objects there are different objects especially SWAP which are not displayed in GUI and can be easily forgotten. However, based on the size of environment they might consume considerable amount of storage space.
  1. VM SWAP object does not adhere to Storage Policy assigned to VM. What does it mean? Even if you configure your VM’s disks with PFTT=0

SWAP will always utilize PFTT=1. Unless you configure advanced option (SwapThickProfivisionedDisabled) to disable it.

I have made a test to check how much space will consume my empty VM. (Empty means here without operating system even)

In order to see that a VM called Prod-01 has been created with 1 GB of memory and 2 GB of Hard disk and default storage policy assigned (PFTT=1)

Based on the Edit Setting window the VM disk size on datastore is 4 GB (Maximum sized based on disk size and policy). However, used storage space is 8 MB which means there will be 2 replicas 4 MB each, which is fine as there is no OS installed at all.

VMka wyłączona

However, when you open datastore files you will see this list with Virtual Disk object you will notice that the size is 36 864 KB which gives us 36 MB. So it’s neither 4 GB nor 8 MB as displayed by edit setting consumption..vsan pliki

Meanwhile datastore provisioned space is listed as 5,07 GB.

vmka dysk 2GB default policy i 1GB RAM - wyłączona

 

So let’s power on that VM.

Now the disks size remain intact, but other files appear as for instance SWAP has been created as well as log and other temporary files.

VSAN VMKa wlaczona

 

Looking at datastore provisioned space now it shows 5,9 GB. Which again is confisung even if we forgot about previous findings powering on VM triggers SWAP creation which according to the theory should be protected with PFTT=1 and be thick provisioned. But if that’s the case then the provisioned storage consumption should be increased by 2 GB not 0,83 (where some space is consumed for logs and other small files included in Home namespace object)

 

vmka dysk 2GB default policy i 1GB RAM - włączona

Moreover during those observations I noticed that during the VM booting process the provisioned space is peaking up to 7,11 GB for a very short period of time

And this value after a few seconds decreases to 5.07 GB. Even after a few reboots those values stays consistent.

vmka dysk 2GB default policy i 1GB RAM - podczas bootowania

The question is why those information are not consistent and what heppens during booting of the VM that is the reason for peak of provisioned space?

That’s the quest for not to figure it out 🙂

 

 

VMware Virtual SAN 6.6 what’s new

VMware Virtual SAN 6.6 what’s new

1vsan

vSAN 6.6 it’s 6th generation of the product and there are more than 20+ new features and enhancements in this release, such as:

  • Native encryption for data-at-rest
  • Compliance certifications
  • Resilient management independent of vCenter
  • Degraded Disk Handling v2.0 (DDHv2)
  • Smart repairs and enhanced rebalancing
  • Intelligent rebuilds using partial repairs
  • Certified file service & data protection solutions
  • Stretched clusters with local failure protection
  • Site affinity for stretched clusters
  • 1-click witness change for Stretched Cluster
  • vSAN Management Pack for vRealize
  • Enhanced vSAN SDK and PowerCLI
  • Simple networking with Unicast
  • vSAN Cloud Analytics with real-time support notification and recommendations
  • vSAN Config Assist with 1-click hardware lifecycle management
  • Extended vSAN Health Services
  • vSAN Easy Install with 1-click fixes
  • Up to 50% greater IOPS for all-flash with optimized checksum and dedupe
  • Support for new next-gen workloads
  • vSAN for Photon in Photon Platform 1.1
  • Day 0 support for latest flash technologies
  • Expanded caching tier choice
  • Docker Volume Driver 1.1

 

… ok now lets review main enhancements:

vSAN 6.6 introduces the industry’s first native HCI security solution. vSAN will now offer data-at-rest encryption that is completely hardware-agnostic. No more concern about someone walking off with a drive or breaking in to a less-secure, edge IT location and stealing hardware. Encryption is applied at the cluster level, and any data written to a vSAN storage device, both at the cache layer and persistent layer can now be fully encrypted.  And vSAN 6.6 supports 2-factor authentication, including SecurID and CAC.

2vsan

Certified file services and data protection solutions are available from 3rd party partners in the VMware Ready for vSAN Program to enable customers to extend and complement their vSAN environment with proven, industry-leading solutions. These solutions provide customers with detailed guidance on how to complement vSAN. (EMC NetWorker is avaialble today with new solutions coming on soon)

3vsan

vSAN stretched cluster was released in Q3’15 to provide an Active-Active solution. vSAN 6.6 adds a major new capability that will deliver a highly-available stretched cluster that addresses the highest resiliency requirements of data centers. vSAN 6.6 adds support for local failure protection that can provide resiliency against both site failures and local component failures.

4vsan

PowerCLI Updates: Full featured vSAN PowerCLI cmdlets enable full automation that includes all the latest features. SDK/API updates also enable enterprise-class automation that brings cloud management flexibility to storage by supporting REST APIs.

VMware vRealize Operations Management Pack for vSAN released recently, provides customers with native integration for simplified management and monitoring. The vSAN management pack is specifically designed to accelerate time to production with vSAN, optimize application performance for workloads running on vSAN and provide unified management for the Software Defined Datacenter (SDDC). It provides additional options for monitoring, managing and troubleshooting vSAN along with the end-to-end infrastructure solutions.

5vsan

Finally, vSAN 6.6 is well suited for next-generation applications. Performance improvements, especially when combined with new flash technologies for write-intensive applications, enable vSAN to address more emerging applications like Big Data. The vSAN team has also tested and released numerous reference architectures for these types of solutions, including Big Data, Splunk and InterSystems Cache.

RESOURCES:

  • Splunk Reference Architecture: http://www.emc.com/collateral/service-overviews/h15699-splunk-vxrail-sg.pdf
  • Citrix XenDestkop/XenApp Blog: https://blogs.vmware.com/virtualblocks/2017/02/27/citrix-xenapp-xendesktop-7-12-vmware-vsan-6-5-flash/
  • vSAN, VxRail and Pivotal Cloud Foundry RA: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/vsan/vmware-pcf-vxrail-reference-architeture.pdf
  • vSAN and InterSystems Blog: https://community.intersystems.com/post/intersystems-data-platforms-and-performance-%E2%80%93-part-8-hyper-converged-infrastructure-capacity
  • Intel, vSAN and Big Data Hadoop: https://builders.intel.com/docs/storagebuilders/Hyper-Converged_big_data_using_Hadoop_with_All-Flash_VMware_vSAN.pdf

 

 

Virtual SAN Storage performance tests in action

Virtual SAN Storage performance tests in action

Virtual SAN provides the Storage Performance Proactive Tests which lets you to check parameters of your environment in an easy way. You just need few clicks to run a test.

Well, we can see a lot of tests from nested labs in the Internet, however there are not so many real examples.

I decided to share some results of such a real environment which consists 5 ESXi hosts. Each is equipped with 1x SSD Disk 1,92 TB and 5x HDD 4 TB.

It’s almost 10% of space reserved for cache on SSD. VMware claims that the minimum is 10% so it shouldn’t be so bad.

Since Virtual SAN 6.2 released with vSphere 6.0 Update 2, VMware makes the testing of performance much more easy. That’s possible thanks to Storage Performance Proactive Tests which lets you to check parameters of your environment in an easy way. They are available from Web Client, you just need few clicks to run a test. Perhaps those aren’t the most sophisticated whilst they are really easy to use.

To start some tests, simply move to Monitor>Virtual SAN>Proactive Tests tab on VSAN cluster level and click at run test button (green triangle).

As you quickly realise there are few kind of tests:

  • Stress Test
  • Low Stress test
  • Basic Sainity Test, focus on Flash cache layer
  • Performance characterization – 100% Read, optimal RC usage
  • Performance characterization – 100% Write, optimal WB usage
  • Performance characterization – 100% Read, optimal RC usage after warm-up
  • Performance characterization – 70/30 read/write mix, realistic, optimal flash cache usage
  • Performance characterization – 70/30 read/write mix, realistic, High I/O Size, optimal flash cache usage
  • Performance characterization – 100% read, Low RC hit rate / All-Flash Demo
  • Performance characterization – 100% streaming reads
  • Performance characterization – 100% streaming writes

 

Let’s start from a multicast performance test of our network. In case the received bandwidth is below 75 MB/s the rest would fail.

multicast-performance-test

Caution!!! VMware doesn’t recommend using this tests  on production environments especially during business hours.

Test number 1 – Low stress test, the duration set to 5 minutes.

low-stress

As we can see the IOPS counts around 40K for my cluster of five hosts. The average throughput around 30 MB/s, gives ca. 155 MB/s total.

Test number 2 – Stress test, duration set to 5 minutes.

stress-test

Here we can see that my VSAN reached about 37K of IOPS and almost 260 MB /s of throughput.

Test number 3 – Basic Sanity test, focus on Flash cache layer, duration set to 5 minutes

basic-sanity-test-focus-on-flash-cache-layer

Test number 4 – 100% read, optimal RC usage, duration set to 5 minutes.

100-percent-read-optimal-rc-usage

Here we can see how does the SSD performs while most of the reads

Test number 5 –  100% read, optimal RC usage after warm-up

100-percent-read-optimal-rc-usage-after-warmup

Test number 6 – 100 % write, optimal WB usage

100-percent-write-optimal-wb-usage

 

 

If you have any other real results of your VSAN I’d be glad to see it and compare different configurations.