Browsed by
Author: Radek

Interested in a virtualization architecture and technology. VCP since 2006. VMware Certified Instructor. Former NetApp Certified Instructor. I teach people about how vSphere works, how it has been designed, what are the features and how to use them to achieve what they want.
General vSAN Error

General vSAN Error

vSAN is a wonderful shared storage option in a vSphere cluster, but it requires an administrator with deep product knowledge and overall awareness to be able to manage it with an understanding of its quirks and gotchas. I’ve worked with several vSAN clusters composed of many nodes for a few years now but sometimes it still surprises me. I’ve recently spent a couple of hours troubleshooting a “General vSAN Error” to figure out why I couldn’t put a host in Maintenance Mode. Finally I found out that it was done on purpose. I decided to describe my experience to help others to resolve their vSAN issues.

Usually, if I want to check some scenario as quickly as possible, I use one of the VMware Hands On Labs environments, which I reconfigure just as I need it. This time I used “HOL-2008-01-HCI – vSAN – Getting Started”. It is based on the 6.7 version. I know it’s not a current vSAN version, but it is mature enough to use it for testing. I wanted to check how a three-node cluster would behave if I put one of the nodes in Maintenance Mode choosing “Full data migration” as a data evacuation option. A VM which was run in the cluster used “vSAN Default Storage Policy”. The task quickly failed after it started, with an error message “General vSAN error”. I immediately checked if there was enough storage space left on disks of the remaining nodes and there was. A “CORE-A” VM was consuming just 492.1 MB from almost 60 GB of vSAN datastore. Even if I put one host in Maintenance Mode, it would be enough storage space from the remaining two nodes. I decided to confirm this conclusion, so I opened a SSH session to vCenter Server Appliance (vCSA). I ran these commands:

rvc administrator@corp.local@vcsa-01a.corp.local
vsan.whatif_host_failures -s 1/RegionA01/computers/RegionA01-COMP01/

It showed me which percent of storage space was used per node and how these numbers would change after a simulated failure of 1 node. It didn’t look suspicious.

Next, I checked “Task Console” in vSphere Client to find any clues. A description added to the error message confused me: “Evacuation precheck failed – Retry operation after adding 1 nodes with each node having 1 GB worth of capacity.” and I ignored it without thinking. I dived into kb.vmware.com to find any clues there.
I quickly found this article: “out of resources” error when entering maintenance mode on vSAN hosts with large vSAN objects (2149615).
This got my attention to vSAN’s clomd service. I decided to check /var/log/clomd.log. I opened a SSH session to an ESXi host and found in last four consecutive lines that decommission operation was started and it changed its state as shown below:

DECOM_STATE_NONE 
DECOM_STATE_ACTIVE
DECOM_STATE_AUDIT
DECOM_STATE_FAILED
DECOM_STATE_NONE

Also, I decided to find if there were any known problems with decommissioning nodes from vSAN clusters. I quickly found another article: “vSAN Host Maintenance Mode is in sync with vSAN Node Decommission State (51464)” and I used this recommended command to check if there were any problems in vSAN database with node decommissioning:

cmmds-tool find -t NODE_DECOM_STATE -f json | grep ‘uuid\|decomState’

The results showed that values for decomState key were equal to zero. It indicated that there weren’t any problems with background decommission operation which froze.

Then, I decided to find any traces in VMware’s community resources. I easily found that my issue was well known and there were some solutions.
In the post titled “A general system error occurred: Operation failed due to a VSAN error. Another host in the cluster is already entering maintenance mode” I found out that I should try to break any Maintenance Mode entering operations using this command:

localcli vsan maintenancemode cancel

In order to put a host into Maintenance Mode I should use this command:

localcli system maintenanceMode set -e true -m noAction

I found it useful, but putting a host into Maintenance Mode without data evacuation wasn’t what I was looking for.

Finally, desperately I decided to search the product documentation to find the answers. And my life got easier from the first hit. In vSAN documentation in the article titled “Place a Member of vSAN Cluster in Maintenance Mode” I found this definition of the available data evacuation options:

Ensure accessibility – “This is the default option. When you power off or remove the host from the cluster, vSAN ensures that all accessible virtual machines on this host remain accessible. Select this option if you want to take the host out of the cluster temporarily, for example, to install upgrades, and plan to have the host back in the cluster. This option is not appropriate if you want to remove the host from the cluster permanently.
Typically, only partial data evacuation is required. However, the virtual machine might no longer be fully compliant to a VM storage policy during evacuation. That means, it might not have access to all its replicas. If a failure occurs while the host is in maintenance mode and the Primary level of failures to tolerate is set to 1, you might experience data loss in the cluster.”

And finally the most important note was this one:

“This is the only evacuation mode available if you are working with a three-host cluster or a vSAN cluster configured with three fault domains.”

The rest of the definitions you can read there, but what I read was the explanation I was looking for.

If you use a three-node vSAN cluster and want to put a host in Maintenance Mode to be able to do any service activities, you don’t have an option to fully protect hosted VMs. It can be done by using at least 4 nodes in the cluster.

Remember folks, the old rule “RTFM” still counts!

vRA 7.x Snapshotting using PowerCLI Script

vRA 7.x Snapshotting using PowerCLI Script

Recently, I was looking for a scripted method for cold snapshotting of vRA in enterprise deployment. First, I wanted to confirm the shutdown order. VMware’s product documentation is quite limited regarding this topic, but it describes the right shutdown order. Looking further I found a better explanation here, where the documentation talks about vRA backup order. Of course, I wanted to check the proper starting procedure as well. I found it quickly here.

Well equipped with all information needed, I decided to write a script. Usually I use PowerCLI for tasks related to VMware software. I was sure that someone had the same idea before me so in order not to reinvent the wheel I started to check what Google will find.
I found a few elegant approaches, but one was the most interesting and inspiring.
Distributed vRealize Automation 7.x Orchestrated Shutdown, Snapshot and Startup using PowerCLI
Razz made it the way that I found I could use in my environment. I adjusted the script to my needs and it worked quite well. Of course, there is a lot to polish, but it works for me.

I decided to share Razz’s script adjusted by me. Maybe it will help someone with their administration tasks related to vRA.


$vCSA = ""
$snapName = ""
$snapDescription = ""
$log = ""

# vRA components
$proxy = @() 
$worker = @() 
$activeMgr = ""
$passiveMgr = @()
$primaryWeb = ""
$secondaryWeb = @()
$masterVRA = ""
$replicaVRA = @()
$dbServers = @()
$vRB = ""
$vRO = @()
$allVMs = @()

# Log file
$log = "coldvRASnapshots.log"

function shutdownVMandWait($vms,$log) {
    foreach ($vmName in $vms) {
        try {
            $vm = Get-VM -Name $vmName -ErrorAction Stop
            foreach ($o in $vm) {
                    if (($o.PowerState) -eq "PoweredOn") {
                        $v = Shutdown-VMGuest -VM $o -Confirm:$false
                        Write-Host "Shutdown VM: '$($v.VM)' was issued"
                        Add-Content -Path $log -Value "$($v)"
                    } else {
                        Write-Host "VM '$($vmName)' is not powered on!"
                    }
            }   
        } catch {
            Write-Host "VM '$($vmName)' not found!"
        }
    }
    foreach ($vmName in $vms) {
        try {
            $vm = Get-VM -Name $vmName -ErrorAction Stop
            while($vm.PowerState -eq 'PoweredOn') { 
                sleep 5
                Write-Host "VM '$($vmName)' is still on..."
                $vm = Get-VM -Name $vmName
            }
            Write-Host "VM '$($vmName)' is off!"
        } catch {
            Write-Host "VM '$($vmName)' not found!"
        }
    }
}

function snapshotVM($vms,$snapName,$snapDescription,$log) {
    foreach ($vmName in $vms) {
        try {
            $vm = Get-VM -Name $vmName -ErrorAction Stop
        } catch {
            Write-Host "VM '$($vmName)' not found!"
            Add-Content -Path $log -Value "VM '$($vmName)' not found!"
        }
        try {
            foreach ($o in $vm) {
                    New-Snapshot -VM $o -Name $snapName -Description $snapDescription -ErrorAction Stop   
            }
        } catch {
            Write-Host "Could not snapshot '$($vmName)' !"
            Add-Content -Path $log -Value "Could not snapshot '$($vmName)' !"
    
        }
    }
}

function startupVM($vms,$log) {
    foreach ($vmName in $vms) {
        try {
            $vm = Get-VM -Name $vmName -ErrorAction Stop
            foreach ($o in $vm) {
                if (($o.PowerState) -eq "PoweredOff") {
                        Start-VM -VM $o -Confirm:$false -RunAsync
                    } else {
                        Write-Host "VM '$($vmName)' is not powered off!"
                    }
                }   
        } catch {
            Write-Host "VM '$($vmName)' not found!"
        }
    } 
}

# vCenter Server FQDN
$vCSA = Read-Host -Prompt "Enter vCenter's FQDN"

# Connect vCenter Server
$creds = Get-Credential

Connect-VIServer $vCSA -Credential $creds -ErrorAction Stop

# Get VM names
$proxy = @(((Read-host -Prompt "Enter comma separated names of Proxy Agent VMs").Split(",")).Trim()) 
$worker = @(((Read-host -Prompt "Enter comma separated names of DEM worker VMs").Split(",")).Trim()) 
$activeMgr = Read-host -Prompt "First, check which VM is a Primary Manager and then enter its name"
$passiveMgr = @(((Read-host -Prompt "Enter comma separated names of Secondary Manager VMs").Split(",")).Trim())
$primaryWeb = Read-host -Prompt "First, check which VM is a Primary Web Server and then enter its name"
$secondaryWeb = @(((Read-host -Prompt "Enter comma separated names of Secondary Web VMs").Split(",")).Trim())
$masterVRA = Read-host -Prompt "Enter a name of Master vRA Node VM"
$replicaVRA = @(((Read-host -Prompt "Enter comma separated names of Replica vRA Node VMs").Split(",")).Trim())
$dbServers = @(((Read-host -Prompt "Shutdown MSSQL AlwaysOn Cluster first, than enter comma separated names of DB Cluster Node VMs").Split(",")).Trim())

<# ### Uncomment all commented block of code if you have vRB or the external vRO instances in your environment
$vRB = Read-host -Prompt "Enter a name of vRB VM"
$vRO = @(((Read-host -Prompt "Enter comma separated names of external vRO VMs").Split(",")).Trim())
#>

$allVMs = @($proxy, $worker, $passiveMgr, $activeMgr, $secondaryWeb, $primaryWeb, $replicaVRA, $masterVRA, $dbServers)

# Snapshot definition
$snapName = Read-Host -Prompt "Enter Snapshot Name"
$snapDescription = Read-Host -Prompt "Enter Snapshot Description"

# Shutting down vRA VMs
foreach ($vmName in $allVMs) {
    foreach ($vm in $vmName) {
        if ($vm) {
            Write-Host "### Shutting down " + $vm
            shutdownVMandWait -vms $vm -log $log
        } else {
            Write-Host "VM '$($vm)' doesn't exist!"
        }
    }   
}

# Snapshotting vRA VMs
foreach ($vmName in $allVMs) {
    foreach ($vm in $vmName) {
        if ($vm) {
            Write-Host "### Taking snapshot of " + $vm
            snapshotVM -vms $vm -snapName $snapName -snapDescription $snapDescription -log $log
        } else {
            Write-Host "VM '$($vm)' doesn't exist!"
        }
    }   
}

# Starting vRA VMs
<#
Write-Host "### Starting vROs"
startupVM -vms $vRO -log $log

Write-Host "### Starting vRB"
startupVM -vms $vRB -log $log
#>

Write-Host "### Starting DB Servers"
startupVM -vms $dbServers -log $log
Write-Host  " Sleeping 5 minutes until db is up"
Start-Sleep -s 300

Write-Host "### Starting primary VRA"
startupVM -vms $masterVRA -log $log
Write-Host  " Sleeping 5 minutes until Licensing service is registered"
Start-Sleep -s 300

Write-Host "### Starting secondary VRA"
startupVM -vms $replicaVRA -log $log
Write-Host  " Sleeping 15 minutes until ALL services are registered"
Start-Sleep -s 900

Write-Host "### Starting Web"
startupVM -vms $primaryWeb -log $log
startupVM -vms $secondaryWeb -log $log
Write-Host  " Sleeping 5 minutes until services are up"
Start-Sleep -s 300

Write-Host "### Starting Primary manager"
startupVM -vms $activeMgr -log $log
Write-Host  " Sleeping 3 minutes until manager is up"
Start-Sleep -s 180

Write-Host "### Starting Secondary manager"
startupVM -vms $passiveMgr -log $log
Write-Host  " Sleeping 3 minutes until manager is up"
Start-Sleep -s 180

Write-Host "### Starting DEM workers"
startupVM -vms $worker -log $log

Write-Host "### Starting Proxy Agents"
startupVM -vms $proxy -log $log

Write-Host "### All components have been started"

# Disconnect vCenter 
Disconnect-VIServer -Server $vCSA -Confirm:$false 
Getting Support Log Bundle using REST API

Getting Support Log Bundle using REST API

Starting from vSphere 7U1 we have got a new tool to generate and download a support log bundle from vCenter Server. It is REST API call used to achieve these tasks. It broadens an already wide range of methods of gathering vm-support log bundle. It has several interesting features.

First, it works even if the vCSA service is offline, whereas a vCSA management interface should be up and running.

Second, once started it generates a support bundle and stores it on vCSA disk. A downloaded bundle is deleted after 30 minutes. As you know, even in a small vSphere environment downloading a log bundle could be time consuming, therefor, if a generated bundle download task is in progress, the bundle deletion will be postponed for 30 minutes.

To be able to use this new method a user authenticated in SSO must be a member of a new SSO group – SystemConfiguration.SupportUsers. Users belonging to this group are entitled only to call support bundle REST API. They don’t have any other privileges to the environment. SSO Administrator role is a member of this group by default.

Now, let’s play with the API calls.

To enumerate every component which is gathered by support bundle REST API, you can use the following GET command:

GET https://vcsa_fqdn:5480/rest/appliance/support-bundle/components

To generate a bundle you can use this POST statement:

POST https://vcsa_fqdn:5480/rest/appliance/support-bundle?action=create&vmw-task=True

This statement takes a few parameters:

  • Description – A text description of a the started task
  • Components (optional) – You can provide a list of previously listed components. If you leave this parameter empty, then logs from all components will be gathered.
  • Partition (optional) – You can define a place where generated log bundle will be stored, for example /storage/core. If you leave it blank, a default storage location (/storage/log) will be used.

This statement will return a task ID. You can use this task ID to get the information about a status of a task.

To return the status of a task you have started, issue this command providing a task ID at the end:

GET https://vcsa_fqdn:5480/rest/cis/tasks/<task-id>

If you need more detailed information, use this command to return the complete information about the generation task:

GET https://vcsa_fqdn:5480/rest/appliance/support-bundle

You will get all the information (description, status, generation time, expiration time, bundle size) including URL, which you can use to download the generated support bundle.

If the support bundle REST API fails, you will be informed with an error message providing information about a failure reason.

I recommend using Postman to issue these commands because you can prepare an Environment, a Collection, named for example vCSA Support Bundle where you can store the commands above mentioned. You can use curl as well, if it’s your favourite multi-tool.

Part 1 – PVRDMA and how to test it in home lab.

Part 1 – PVRDMA and how to test it in home lab.

One of the members of the VMware User Community (VMTN) inspired me to build a configuration where two VMs use PVRDMA network adapters to communicate. The goal I wanted to achieve was to establish the communication between VMs
without using Host Channel Adapter cards installed in hosts. It’s possible to configure it as stated here, in the VMware vSphere documentation.

For virtual machines on the same ESXi hosts or virtual machines using the TCP-based fallback, the HCA is not required.

To do this task, I prepared one ESXi host (6.7U1) managed by vCSA (6.7U1). One of the requirements for RDMA is a vDS. First, I configured a dedicated vDS for RDMA communication. I simply set a basic vDS configuration (DSwitch-DVUplinks-34) with a default portgroup (DPortGroup) and, then, equipped it with just one uplink.

In vSphere, a virtual machine can use a PVRDMA network adapter to communicate with other virtual machines that have PVRDMA devices. The virtual machines must be connected to the same vSphere Distributed Switch.

Second, I created a VMkernel port (vmk1) dedicated to RDMA traffic in this DPortGroup without assigning an IP address to this port (No IPv4 settings).

Third, I set Advanced System Setting Net.PVRDMAVmknic on ESXi host and gave it a value pointing to VMkernel port (vmk1). Tag a VMkernel Adapter for PVRDMA.

Then, I enabled “pvrdma” firewall rule on host in Edit Security Profile window. Enable the Firewall Rule for PVRDMA.

The next steps are related to configuration of the VMs. First, I created a new virtual machine. Then, I added another Network adapter to it and connected it to DPortGroup on vDS. For the Adapter Type of this Network adapter I chose PVRDMA and Device Protocol RoCE v2. Assign a PVRDMA Adapter to a Virtual Machine.

Then, I installed Fedora 29 on a first VM. I chose it because there are many tools to easily test a communication using RDMA. After the OS installation, another network interface showed up on the VM. I addressed it in a different IP subnet. I used two network interfaces in VMs, the first one to have an access through SSH and the second one to test RDMA communication.

Then I set “Reserve all guest memory (All locked)” in VM’s Edit Settings window.

I had two VMs configured enough – in the infrastructure layer – to communicate using RDMA.

To do it I had to install appropriate tools. I found them on GitHub, here.
To use them, I had to install them first. I did it using the procedure described on the previously mentioned page.

dnf install cmake gcc libnl3-devel libudev-devel pkgconfig valgrind-devel ninja-build python3-devel python3-Cython	

Next, I installed the git client using the following command.

yum install git

Then I cloned the git project to a local directory.

mkdir /home/rdma
git clone https://github.com/linux-rdma/rdma-core.git /home/rdma

I built it.

cd /home/rdma
bash build.sh

Afterwards, I cloned a VM to have a communication partner for the first one. After cloning, I reconfigured an appropriate IP address in the cloned VM.

Finally I could test the communication using RDMA.

On the VM that functioned as a server I ran a listener service on the interface mapped to PVRDMA virtual adapter:

cd /home/rdma/build/bin
./rping -s -a 192.168.0.200 -P

I ran this command, which allowed me to connect the client VM to the server VM:

./rping -c -I 192.168.0.100 -a 192.168.0.200 -v

It was working beautifully!

pvrdma

dcli and how to shutdown vCSA

dcli and how to shutdown vCSA

Sometimes You want to shutdown vCSA or PSC gracefully, but You don’t have an access to GUI through vSphere Client or VAMI.

How to do it in CLI? I’m going to show You right now using dcli, because I’m exploring a potential of this tool and I can’t get enough.

  1. Open an SSH session to vCSA and log in as root user.
  2. Run dcli command in an interactive mode.

    dcli +i
  3. Use shutdown API call, to shutdown an appliance, giving a delay value (0 means now) and a description of disk task.

    com vmware appliance shutdown poweroff --delay 0 --reason 'Shutdown now'
  4. Enter an appropriate administrator user name e.g. administrator@vsphere.local and a password.
  5. Decide if You want save the credentials in the credstore. You can enter ‘y’ as yes.

Wait until the appliance will go down. 🙂

I know that it’s not the quickest way, but the point is to have fun.

VCSA Tools – Part 1 – journalctl. Better way for vCSA log revision.

VCSA Tools – Part 1 – journalctl. Better way for vCSA log revision.

There’s a plenty of great CLI tools in VCSA that modern vSphere administrator should know, so I decided to share my knowledge and describe them in the series of articles.

The first one is journalctl. A tool that simplifies and quickens the VCSA troubleshooting process.

Below I’m presenting how I’m using it, to filter the logs records.

Log in to VCSA shell and run the commands below, regarding to the result you want to achive.

The logs from the current boot:

journalctl -b


The boots that journald is aware of:

journalctl --list-boots

It will show this kind of results representing each known boot:

-3 26c7f0e356eb4d0bbd8d9fad1b457808 Wed 2019-01-09 15:18:12 UTC—Wed 2019-01-09 16:58:18 UTC
-2 124d73cf46d441908db7609813e0c49a Wed 2019-01-09 17:01:08 UTC—Fri 2019-01-11 10:22:39 UTC
-1 ebf1ba1936404fb086727b61ac825d47 Fri 2019-01-11 11:22:59 UTC—Fri 2019-01-11 15:48:37 UTC
 0 bd70a6d99a7245dc9b7859c5f2a7ef6f Mon 2019-01-14 10:19:47 UTC—Tue 2019-01-15 12:38:16 UTC

Now you can use it, to display the results of the chosen boot:

journalctl -b -1


To limit the results to the given timeframe, use dates or keywords:

journalctl --since "2019-01-10" --until "2019-01-14 03:00"

or:

journalctl --since yesterday

or:

journalctl --since 09:00 --until "2 hour ago"


Limiting the logs by service name:

Get the service names first:

systemctl list-unit-files | grep -i vmware

To show only records for vpxd service in current boot:

journalctl -b -u vmware-vpxd.service


Filtering by Id (Process, User or Group):

Get the user Id, e.g.:

cat /etc/passwd | grep -i updatemgr

then:

journalctl _UID=1017 --since today


Filtering by the binary path:

journalctl -b --since "20 minute ago" /usr/sbin/vpxd


Displaying kernel messages:

journalctl -k


Limiting messages by their priorities:

journalctl -p err -b

where the priority codes are:

0: emerg
1: alert
2: crit
3: err
4: warning
5: notice
6: info
7: debug

Of course those examples don’t exhaust all the possibilities this tool has, but consider them as a usage starting point. Feel free to add new, useful examples in comments below.

That’s all for today!

dcli and orphaned VMs in vCenter Server inventory

dcli and orphaned VMs in vCenter Server inventory

The orphaned VMs in vCenter inventory is an unusual view in experienced administrator’s Web/vSphere Client window. But in large environments, where many people manage hosts and VMs it will happen sometimes.

You do know how to get rid of them using traditional methods described in VMware KB articles and by other well known bloggers, but there’s a quite elegant new method using dcli.
This handy tool is available in vCLI package, in 6.5/6.7 vCSA shell and vCenter Server on Windows command prompt. Dcli does use APIs to give an administrator the interface to call some methods to get done or to automate some tasks.

How to use it to remove orphaned VMs from vCenter inventory?

  1. Open an SSH session to vCSA and log in as root user.
  2. Run dcli command in an interactive mode.

    dcli +i
  3. Get a list of VMs registered in vCenter’s inventory. Log in as administrator user in your SSO domain. You can save credentials in the credstore for future use.

    com vmware vcenter vm list
  4. From the displayed list get VM’s MoID (Managed Object Id) of the affected VM, e.g. vm-103.
  5. Run this command to delete the record of the affected VM using its MoID from vCenter’s database.

    com vmware vcenter vm delete --vm vm-103
  6. Using Web/vSphere Client check the vCenter’s inventory if the affected VM is now deleted.

It’s working!

Part 2 – How to list vSwitch “MAC Address table” on ESXi host?

Part 2 – How to list vSwitch “MAC Address table” on ESXi host?

The other way to list MAC addresses of open ports on vSwitches on the ESXi host is based on net-stats tool.

Use this one-liner.

		
for VSWITCH in $(vsish -e ls /net/portsets/ | cut -c 1-8); do net-stats -S $VSWITCH | grep \{\"name | sed 's/[{,"]//g' | awk '{$9=$10=$11=$12=""; print $0}'; done		
		
	

This is not a final word. 🙂