Browsed by
Tag: vsphere

A wrap-up of my own VCDX Jurney #291

A wrap-up of my own VCDX Jurney #291

This is a post I was about to write already 2 months ago as a summary of 2020 as it’s imporant to wrap-up a long time I spent preparing for VCDX and also say Thank You. However, due to other tasks, engagements, etc. I just managed to complete it right now!

So first things first – finally, after a very long and extremely informative journey at 15 Dec 2020, I finally received probably the best e-mail ever.

Hi Pawel,
Congratulations! You passed! It gives me great pleasure to welcome you to the VMware Certified Design Expert community.
Your VCDX number is 291.

Initially I was about to write a long story here about my VCDX journey, but then I decided to just make a few points that can be treated as advice, that hopefully might help someone.

  1.  Don’t blindly trust everything you read on the internet about the VCDX process -> it’s like an assumption and if you already started the process already then you know what’s wrong with assumptions 😉
    -> E.g. for me back in 2017 after passing both VCAPs when I decided to set the next goal to be a VCDX.
    Along with my colleague we just completed a very good project which at that time seemed to be a perfect fit for the VCDX. I’ve read also some stories about VCDX jurneys taking 150-200 hours in total to prepare for that exam. That was almost like nothing after dozens of overhours spent with the project we just completed to get it on time. So the decision was made 200 hours is not much after all. So we took the docs we prepared for the Customer, translated them into english, submitted and apparently we were invited for the defense during which we immediatelly and brutally realized we are not even close to make the cut.
  2.  Don’t try to be a hidden hero and brute-force doesn’t work here.
    -> At the beggining I completely ignored the community aspect. I wanted to make it on my own and then just shine with the number. I didn’t participate even a single mock session, didn’t have a mentor, nothing. I’ve read lots of blog posts stressing the importance of it, but I just ignored it. Everything I knew about the defence was from blogs without the real experience with any mocks. I did work with customers at that time so I belived it’s enough. It was wrong, though. If you plan to go that way – just don’t. VCDX community is amazing, you don’t need to be an expert in each area. The community can help you – same with peers at work. If there is someone who can explain you some networking, storage, backup or other aspects – just take that ask for help. There is no need to re-invent the wheel! And this is still an exam which has a formula, that you must be aware of.
  3.  Be realistic and honest with yourself about the time and schedule.
    -> The truth is that along with my peer we went for the first defence completely not prepared. It was back in 2017 and our defence was scheduled just 2 days after my own weeding. It was definitely wrong. Consequently, we started working on our slide deck for the defense in the evening before the defense in the hotel room after landing in UK. It was just unnecessary waste of money, our time as well as panelist time, not good for life balance too!

It was a long jurney and I could find much more examples like those but in my opinion that were the key mistakes I did.
However, there is always a positive side. In reality thanks to those failures ( I know it’s a cliche 🙂 ) it was a huge learning curve for me, that help me not only to pass the exam eventually but most importantly learn the methodology and how to be methodological, soft skills and be comfortable in uncomfortable situations. That’s something I can use right now on the daily basis while working with customers.
kudos

I also owe a public kudos to everyone who I met and who directly or indirectly helped me during this long jurney! (Following list is in alphabetical order based on the first name :)).

Abdullah Abdullah
Asaf Blubshtein
Bilal Ahmed
Chris Noon
Chris Porter
Daniel Zuthof
Fouad El Akkad
Gerard Murphy
Gregg Robertson
Igor Zecevic
Inder S
Jason Grierson
Faisal Rahman
Miroslaw Karwasz
Paul Cradduck
Paul McSharry (before becoming a panelist)
Paul Meehan
Pawel Omiotek
Phoebe Kim
Rene van den Bedem
Shady ElMalatawey
Szymon Ziółkowski
Wesley Geelhoed

Everone who attended my mocks to provide feedback and I forgot to mention above!

+ All panelist of course 🙂

Hope I didn’t miss anyone! If I did – sorry it wasn’t intentional I do appreciated all help and hints that helped be expand the knowledge and develop.

General vSAN Error

General vSAN Error

vSAN is a wonderful shared storage option in a vSphere cluster, but it requires an administrator with deep product knowledge and overall awareness to be able to manage it with an understanding of its quirks and gotchas. I’ve worked with several vSAN clusters composed of many nodes for a few years now but sometimes it still surprises me. I’ve recently spent a couple of hours troubleshooting a “General vSAN Error” to figure out why I couldn’t put a host in Maintenance Mode. Finally I found out that it was done on purpose. I decided to describe my experience to help others to resolve their vSAN issues.

Usually, if I want to check some scenario as quickly as possible, I use one of the VMware Hands On Labs environments, which I reconfigure just as I need it. This time I used “HOL-2008-01-HCI – vSAN – Getting Started”. It is based on the 6.7 version. I know it’s not a current vSAN version, but it is mature enough to use it for testing. I wanted to check how a three-node cluster would behave if I put one of the nodes in Maintenance Mode choosing “Full data migration” as a data evacuation option. A VM which was run in the cluster used “vSAN Default Storage Policy”. The task quickly failed after it started, with an error message “General vSAN error”. I immediately checked if there was enough storage space left on disks of the remaining nodes and there was. A “CORE-A” VM was consuming just 492.1 MB from almost 60 GB of vSAN datastore. Even if I put one host in Maintenance Mode, it would be enough storage space from the remaining two nodes. I decided to confirm this conclusion, so I opened a SSH session to vCenter Server Appliance (vCSA). I ran these commands:

rvc administrator@corp.local@vcsa-01a.corp.local
vsan.whatif_host_failures -s 1/RegionA01/computers/RegionA01-COMP01/

It showed me which percent of storage space was used per node and how these numbers would change after a simulated failure of 1 node. It didn’t look suspicious.

Next, I checked “Task Console” in vSphere Client to find any clues. A description added to the error message confused me: “Evacuation precheck failed – Retry operation after adding 1 nodes with each node having 1 GB worth of capacity.” and I ignored it without thinking. I dived into kb.vmware.com to find any clues there.
I quickly found this article: “out of resources” error when entering maintenance mode on vSAN hosts with large vSAN objects (2149615).
This got my attention to vSAN’s clomd service. I decided to check /var/log/clomd.log. I opened a SSH session to an ESXi host and found in last four consecutive lines that decommission operation was started and it changed its state as shown below:

DECOM_STATE_NONE 
DECOM_STATE_ACTIVE
DECOM_STATE_AUDIT
DECOM_STATE_FAILED
DECOM_STATE_NONE

Also, I decided to find if there were any known problems with decommissioning nodes from vSAN clusters. I quickly found another article: “vSAN Host Maintenance Mode is in sync with vSAN Node Decommission State (51464)” and I used this recommended command to check if there were any problems in vSAN database with node decommissioning:

cmmds-tool find -t NODE_DECOM_STATE -f json | grep ‘uuid\|decomState’

The results showed that values for decomState key were equal to zero. It indicated that there weren’t any problems with background decommission operation which froze.

Then, I decided to find any traces in VMware’s community resources. I easily found that my issue was well known and there were some solutions.
In the post titled “A general system error occurred: Operation failed due to a VSAN error. Another host in the cluster is already entering maintenance mode” I found out that I should try to break any Maintenance Mode entering operations using this command:

localcli vsan maintenancemode cancel

In order to put a host into Maintenance Mode I should use this command:

localcli system maintenanceMode set -e true -m noAction

I found it useful, but putting a host into Maintenance Mode without data evacuation wasn’t what I was looking for.

Finally, desperately I decided to search the product documentation to find the answers. And my life got easier from the first hit. In vSAN documentation in the article titled “Place a Member of vSAN Cluster in Maintenance Mode” I found this definition of the available data evacuation options:

Ensure accessibility – “This is the default option. When you power off or remove the host from the cluster, vSAN ensures that all accessible virtual machines on this host remain accessible. Select this option if you want to take the host out of the cluster temporarily, for example, to install upgrades, and plan to have the host back in the cluster. This option is not appropriate if you want to remove the host from the cluster permanently.
Typically, only partial data evacuation is required. However, the virtual machine might no longer be fully compliant to a VM storage policy during evacuation. That means, it might not have access to all its replicas. If a failure occurs while the host is in maintenance mode and the Primary level of failures to tolerate is set to 1, you might experience data loss in the cluster.”

And finally the most important note was this one:

“This is the only evacuation mode available if you are working with a three-host cluster or a vSAN cluster configured with three fault domains.”

The rest of the definitions you can read there, but what I read was the explanation I was looking for.

If you use a three-node vSAN cluster and want to put a host in Maintenance Mode to be able to do any service activities, you don’t have an option to fully protect hosted VMs. It can be done by using at least 4 nodes in the cluster.

Remember folks, the old rule “RTFM” still counts!

Part 1 – PVRDMA and how to test it in home lab.

Part 1 – PVRDMA and how to test it in home lab.

One of the members of the VMware User Community (VMTN) inspired me to build a configuration where two VMs use PVRDMA network adapters to communicate. The goal I wanted to achieve was to establish the communication between VMs
without using Host Channel Adapter cards installed in hosts. It’s possible to configure it as stated here, in the VMware vSphere documentation.

For virtual machines on the same ESXi hosts or virtual machines using the TCP-based fallback, the HCA is not required.

To do this task, I prepared one ESXi host (6.7U1) managed by vCSA (6.7U1). One of the requirements for RDMA is a vDS. First, I configured a dedicated vDS for RDMA communication. I simply set a basic vDS configuration (DSwitch-DVUplinks-34) with a default portgroup (DPortGroup) and, then, equipped it with just one uplink.

In vSphere, a virtual machine can use a PVRDMA network adapter to communicate with other virtual machines that have PVRDMA devices. The virtual machines must be connected to the same vSphere Distributed Switch.

Second, I created a VMkernel port (vmk1) dedicated to RDMA traffic in this DPortGroup without assigning an IP address to this port (No IPv4 settings).

Third, I set Advanced System Setting Net.PVRDMAVmknic on ESXi host and gave it a value pointing to VMkernel port (vmk1). Tag a VMkernel Adapter for PVRDMA.

Then, I enabled “pvrdma” firewall rule on host in Edit Security Profile window. Enable the Firewall Rule for PVRDMA.

The next steps are related to configuration of the VMs. First, I created a new virtual machine. Then, I added another Network adapter to it and connected it to DPortGroup on vDS. For the Adapter Type of this Network adapter I chose PVRDMA and Device Protocol RoCE v2. Assign a PVRDMA Adapter to a Virtual Machine.

Then, I installed Fedora 29 on a first VM. I chose it because there are many tools to easily test a communication using RDMA. After the OS installation, another network interface showed up on the VM. I addressed it in a different IP subnet. I used two network interfaces in VMs, the first one to have an access through SSH and the second one to test RDMA communication.

Then I set “Reserve all guest memory (All locked)” in VM’s Edit Settings window.

I had two VMs configured enough – in the infrastructure layer – to communicate using RDMA.

To do it I had to install appropriate tools. I found them on GitHub, here.
To use them, I had to install them first. I did it using the procedure described on the previously mentioned page.

dnf install cmake gcc libnl3-devel libudev-devel pkgconfig valgrind-devel ninja-build python3-devel python3-Cython	

Next, I installed the git client using the following command.

yum install git

Then I cloned the git project to a local directory.

mkdir /home/rdma
git clone https://github.com/linux-rdma/rdma-core.git /home/rdma

I built it.

cd /home/rdma
bash build.sh

Afterwards, I cloned a VM to have a communication partner for the first one. After cloning, I reconfigured an appropriate IP address in the cloned VM.

Finally I could test the communication using RDMA.

On the VM that functioned as a server I ran a listener service on the interface mapped to PVRDMA virtual adapter:

cd /home/rdma/build/bin
./rping -s -a 192.168.0.200 -P

I ran this command, which allowed me to connect the client VM to the server VM:

./rping -c -I 192.168.0.100 -a 192.168.0.200 -v

It was working beautifully!

pvrdma

dcli and how to shutdown vCSA

dcli and how to shutdown vCSA

Sometimes You want to shutdown vCSA or PSC gracefully, but You don’t have an access to GUI through vSphere Client or VAMI.

How to do it in CLI? I’m going to show You right now using dcli, because I’m exploring a potential of this tool and I can’t get enough.

  1. Open an SSH session to vCSA and log in as root user.
  2. Run dcli command in an interactive mode.

    dcli +i
  3. Use shutdown API call, to shutdown an appliance, giving a delay value (0 means now) and a description of disk task.

    com vmware appliance shutdown poweroff --delay 0 --reason 'Shutdown now'
  4. Enter an appropriate administrator user name e.g. administrator@vsphere.local and a password.
  5. Decide if You want save the credentials in the credstore. You can enter ‘y’ as yes.

Wait until the appliance will go down. 🙂

I know that it’s not the quickest way, but the point is to have fun.

VCSA Tools – Part 1 – journalctl. Better way for vCSA log revision.

VCSA Tools – Part 1 – journalctl. Better way for vCSA log revision.

There’s a plenty of great CLI tools in VCSA that modern vSphere administrator should know, so I decided to share my knowledge and describe them in the series of articles.

The first one is journalctl. A tool that simplifies and quickens the VCSA troubleshooting process.

Below I’m presenting how I’m using it, to filter the logs records.

Log in to VCSA shell and run the commands below, regarding to the result you want to achive.

The logs from the current boot:

journalctl -b


The boots that journald is aware of:

journalctl --list-boots

It will show this kind of results representing each known boot:

-3 26c7f0e356eb4d0bbd8d9fad1b457808 Wed 2019-01-09 15:18:12 UTC—Wed 2019-01-09 16:58:18 UTC
-2 124d73cf46d441908db7609813e0c49a Wed 2019-01-09 17:01:08 UTC—Fri 2019-01-11 10:22:39 UTC
-1 ebf1ba1936404fb086727b61ac825d47 Fri 2019-01-11 11:22:59 UTC—Fri 2019-01-11 15:48:37 UTC
 0 bd70a6d99a7245dc9b7859c5f2a7ef6f Mon 2019-01-14 10:19:47 UTC—Tue 2019-01-15 12:38:16 UTC

Now you can use it, to display the results of the chosen boot:

journalctl -b -1


To limit the results to the given timeframe, use dates or keywords:

journalctl --since "2019-01-10" --until "2019-01-14 03:00"

or:

journalctl --since yesterday

or:

journalctl --since 09:00 --until "2 hour ago"


Limiting the logs by service name:

Get the service names first:

systemctl list-unit-files | grep -i vmware

To show only records for vpxd service in current boot:

journalctl -b -u vmware-vpxd.service


Filtering by Id (Process, User or Group):

Get the user Id, e.g.:

cat /etc/passwd | grep -i updatemgr

then:

journalctl _UID=1017 --since today


Filtering by the binary path:

journalctl -b --since "20 minute ago" /usr/sbin/vpxd


Displaying kernel messages:

journalctl -k


Limiting messages by their priorities:

journalctl -p err -b

where the priority codes are:

0: emerg
1: alert
2: crit
3: err
4: warning
5: notice
6: info
7: debug

Of course those examples don’t exhaust all the possibilities this tool has, but consider them as a usage starting point. Feel free to add new, useful examples in comments below.

That’s all for today!

dcli and orphaned VMs in vCenter Server inventory

dcli and orphaned VMs in vCenter Server inventory

The orphaned VMs in vCenter inventory is an unusual view in experienced administrator’s Web/vSphere Client window. But in large environments, where many people manage hosts and VMs it will happen sometimes.

You do know how to get rid of them using traditional methods described in VMware KB articles and by other well known bloggers, but there’s a quite elegant new method using dcli.
This handy tool is available in vCLI package, in 6.5/6.7 vCSA shell and vCenter Server on Windows command prompt. Dcli does use APIs to give an administrator the interface to call some methods to get done or to automate some tasks.

How to use it to remove orphaned VMs from vCenter inventory?

  1. Open an SSH session to vCSA and log in as root user.
  2. Run dcli command in an interactive mode.

    dcli +i
  3. Get a list of VMs registered in vCenter’s inventory. Log in as administrator user in your SSO domain. You can save credentials in the credstore for future use.

    com vmware vcenter vm list
  4. From the displayed list get VM’s MoID (Managed Object Id) of the affected VM, e.g. vm-103.
  5. Run this command to delete the record of the affected VM using its MoID from vCenter’s database.

    com vmware vcenter vm delete --vm vm-103
  6. Using Web/vSphere Client check the vCenter’s inventory if the affected VM is now deleted.

It’s working!

VSAN real capacity utilization

VSAN real capacity utilization

There are a few caveats that make the calculation and planning of VSAN capacity tough and gets even harder when you try to map it with real consumption on the VSAN datastore level.

  1. VSAN disks objects are thin provisioned by default.
  2. Configuring full reservation of storage space through Object Space Reservation rule in Storage Policy, does not mean

disk object block will be inflated on a datastore. This only means the space will be reserved and showed as used in VSAN Datastore Capacity pane.

Which makes it even harder to figure out why size of “files” on this datastore is not compliant with other information related to capacity.

  1. In order to plan capacity you need to include overhead of Storage Policies. Policies – as I haven’t met an environment which would use only one for all kinds of workloads. This means that planning should start with dividing workloads for different groups which might require different levels of protections.
  1. Apart from disks objects there are different objects especially SWAP which are not displayed in GUI and can be easily forgotten. However, based on the size of environment they might consume considerable amount of storage space.
  1. VM SWAP object does not adhere to Storage Policy assigned to VM. What does it mean? Even if you configure your VM’s disks with PFTT=0

SWAP will always utilize PFTT=1. Unless you configure advanced option (SwapThickProfivisionedDisabled) to disable it.

I have made a test to check how much space will consume my empty VM. (Empty means here without operating system even)

In order to see that a VM called Prod-01 has been created with 1 GB of memory and 2 GB of Hard disk and default storage policy assigned (PFTT=1)

Based on the Edit Setting window the VM disk size on datastore is 4 GB (Maximum sized based on disk size and policy). However, used storage space is 8 MB which means there will be 2 replicas 4 MB each, which is fine as there is no OS installed at all.

VMka wyłączona

However, when you open datastore files you will see this list with Virtual Disk object you will notice that the size is 36 864 KB which gives us 36 MB. So it’s neither 4 GB nor 8 MB as displayed by edit setting consumption..vsan pliki

Meanwhile datastore provisioned space is listed as 5,07 GB.

vmka dysk 2GB default policy i 1GB RAM - wyłączona

 

So let’s power on that VM.

Now the disks size remain intact, but other files appear as for instance SWAP has been created as well as log and other temporary files.

VSAN VMKa wlaczona

 

Looking at datastore provisioned space now it shows 5,9 GB. Which again is confisung even if we forgot about previous findings powering on VM triggers SWAP creation which according to the theory should be protected with PFTT=1 and be thick provisioned. But if that’s the case then the provisioned storage consumption should be increased by 2 GB not 0,83 (where some space is consumed for logs and other small files included in Home namespace object)

 

vmka dysk 2GB default policy i 1GB RAM - włączona

Moreover during those observations I noticed that during the VM booting process the provisioned space is peaking up to 7,11 GB for a very short period of time

And this value after a few seconds decreases to 5.07 GB. Even after a few reboots those values stays consistent.

vmka dysk 2GB default policy i 1GB RAM - podczas bootowania

The question is why those information are not consistent and what heppens during booting of the VM that is the reason for peak of provisioned space?

That’s the quest for not to figure it out 🙂

 

 

Part 2 – How to list vSwitch “MAC Address table” on ESXi host?

Part 2 – How to list vSwitch “MAC Address table” on ESXi host?

The other way to list MAC addresses of open ports on vSwitches on the ESXi host is based on net-stats tool.

Use this one-liner.

		
for VSWITCH in $(vsish -e ls /net/portsets/ | cut -c 1-8); do net-stats -S $VSWITCH | grep \{\"name | sed 's/[{,"]//g' | awk '{$9=$10=$11=$12=""; print $0}'; done		
		
	

This is not a final word. 🙂