Browsed by
Tag: VM

dcli and orphaned VMs in vCenter Server inventory

dcli and orphaned VMs in vCenter Server inventory

The orphaned VMs in vCenter inventory is an unusual view in experienced administrator’s Web/vSphere Client window. But in large environments, where many people manage hosts and VMs it will happen sometimes.

You do know how to get rid of them using traditional methods described in VMware KB articles and by other well known bloggers, but there’s a quite elegant new method using dcli.
This handy tool is available in vCLI package, in 6.5/6.7 vCSA shell and vCenter Server on Windows command prompt. Dcli does use APIs to give an administrator the interface to call some methods to get done or to automate some tasks.

How to use it to remove orphaned VMs from vCenter inventory?

  1. Open an SSH session to vCSA and log in as root user.
  2. Run dcli command in an interactive mode.

    dcli +i
  3. Get a list of VMs registered in vCenter’s inventory. Log in as administrator user in your SSO domain. You can save credentials in the credstore for future use.

    com vmware vcenter vm list
  4. From the displayed list get VM’s MoID (Managed Object Id) of the affected VM, e.g. vm-103.
  5. Run this command to delete the record of the affected VM using its MoID from vCenter’s database.

    com vmware vcenter vm delete --vm vm-103
  6. Using Web/vSphere Client check the vCenter’s inventory if the affected VM is now deleted.

It’s working!

Perennially reservations weird behaviour whilst not configured correctly

Perennially reservations weird behaviour whilst not configured correctly

Whilst using RDM disks in your environment you might notice long (even extremely long) boot time of your ESXi hosts. That’s because ESXi host uses a different technique to determine if Raw Device Mapped (RDM) LUNs are used for MSCS cluster devices, by introducing a configuration flag to mark each device as perennially reserved that is participating in an MSCS cluster. During the start of an ESXi host, the storage mid-layer attempts to discover all devices presented to an ESXi host during the device claiming phase. However, MSCS LUNs that have a permanent SCSI reservation cause the start process to lengthen as the ESXi host cannot interrogate the LUN due to the persistent SCSI reservation placed on a device by an active MSCS Node hosted on another ESXi host.

Configuring the device to be perennially reserved is local to each ESXi host, and must be performed on every ESXi host that has visibility to each device participating in an MSCS cluster. This improves the start time for all ESXi hosts that have visibility to the devices.

The process is described in this KB  and is requires to issue following command on each ESXi:

 esxcli storage core device setconfig -d naa.id –perennially-reserved=true

You can check the status using following command:

esxcli storage core device list -d naa.id

In the output of the esxcli command, search for the entry Is Perennially Reserved: true. This shows that the device is marked as perennially reserved.

However, recently I came across on a problem with snapshot consolidation, even storage vMotion was not possible for particular VM.

Whilst checking VM settings one of the disks was locked and indicated that it’s running on a delta disks which means there is a snapshot. However, Snapshot manager didn’t showed any snapshot, at all. Moreover, creating new and delete all snapshot which in most cases solves the consolidation problem didn’t help as well.

Per1

In the vmkernel.log while trying to consolidate VM lots of perenially reservation entries was present. Which initially I ignored because there were RDMs which were intentionally configured as perennially reserved to prevent long ESXi boot.

log

However, after digging deeper and checking a few things, I return to perenially reservations and decided to check what the LUN which generates these warnings is and why it creates these entries especially while trying consolidation or storage vMotion of a VM.

To my surprise I realised that datastore on which the VM’s disks reside is configured as perenially reserved! It was due to a mistake when the PowerCLi script was prepared accidentially someone configured all available LUNs as perenially reserved. Changing the value to false happily solved the problem.

The moral of the story is simple – logs are not issued to be ignored 🙂

Mystery of the broken VM

Mystery of the broken VM

Today my colleague (vmware administrator) asked me a small favour – help to perform RCA (root cause analyze) – related to one of production VM that had recently a problem. VM for some reason was migrated (collegue stands that it happened without administrative intervention) to other ESXi host that do not have proper network config for this vm – this caused outage for whole system. I asked about issue time and we went deeper into logs in order to find out what exacly happened.
We started from VM logs called vmware.log.

vmx| I120: VigorTransportProcessClientPayload: opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Receiving PowerState.InitiatePowerOff request.
vmx| I120: Vix: [36833 vmxCommands.c:556]: VMAutomation_InitiatePowerOff. Trying hard powerOff
| vmx| I120: VigorTransport_ServerSendResponse opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Completed PowerState request.
vmx| I120: Stopping VCPU threads…
vmx| I120: VMX exit (0).
| vmx| I120: AIOMGR-S : stat o=29 r=50 w=25 i=63 br=795381 bw=237700 
vmx| I120: OBJLIB-LIB: ObjLib cleanup done.
 -> vmx| W110: VMX has left the building: 0. 

According to VMware KB: 2097159 „VMX has left the building: 0” – is an informational message and is caused by powering off a Virtual Machine or performing a vMotion of the Virtual Machine to a new host. It can be safely ignored so we had a first clue related to vm migration.
Next we moved to fdm.log (using time stamps form vmware.log) and surprise surprise – VM for some reason was powered off 1h before issue occur:

vmx local-host: local power state=powered off; assuming user power off; global power state=powered off
verbose fdm[FF9DE790] [Originator@6876 sub=Invt opID=SWI-95f1a3f] [VmStateChange::SaveToInventory] vm /vmfs/volumes/vmx from __localhost__ changed inventory cleanPwrOff=1

Now we need to find out why VM was powered off – so we went through hostd logs using  grep with VM name:
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3E9C1B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] State Transition (VM_STATE_OFF -> VM_STATE_CREATE_SNAPSHOT)
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotConfigInfoReadEx: Creating new snapshot dictionary, ‘/vmfs/volumes/.vmsd.usd’.
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee
user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/ -000001.vmdk’ to ‘/ user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/000001.vmdk’ to ‘/vmfs/volumes/.vmdk’. 

-> info hostd[3F180B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 485016 : Virtual machine disks consolidated successfully on in cluster in ha-datacenter. 

So we confirmed that shortly before issue occured VM was powered off for snapshot consolidation – at this point we assumed that this might be related to general issue but we decided to verify vobd log (less verbose than vmkernel and give good view about esxi healt in case of storage and networking) :

uplink.transition.down] Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN130. 2 uplinks up. Failed criteria: 128 Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN8 ESOK2. 2 uplinks up. Failed criteria:128
Lost uplink redundancy on virtual switch “vSwitch0”. Physical NIC vmnic0 is down. Affected port groups: “ISCSI”, “Management Network”

This is it, direct hit to the reason of  main issue – network adapter problem, with new time stamps just for confirmation we went back to fdm.log :

fdm[FFC67B70] [Originator@6876 sub=Notifications opID=SWI-69d87e37] [Notification::AddListener] Adding listener of type Csi::Notifications::VmStateChange: Csi::Policies::GlobalPolicy (listeners = 4)
–> Protected vms (0):
–> Unprotect request vms (0):
–> Locked datastores (0):
–> Events (4):
–> EventEx=
com.vmware.vc.HA.ConnectedToMaster vm= host=host-24 tag=host-24:1225982388:0
–> EventEx=com.vmware.vc.HA.AllHostAddrsPingable vm= host=host-24 tag=host-24:1225982388:1
–> EventEx=com.vmware.vc.HA.AllIsoAddrsPingable vm= host=host-24 tag=host-24:1225982388:2
–> EventEx=com.vmware.vc.ha.VmRestartedByHAEvent

With VMware KB: Determining which virtual machines are restarted during a vSphere HA failover (2036555) – confirm that ha react to network outage and power-on vm on diffrenet not afected esxi.
Conclusion – always correlate problematic vm log with esxi (host,vobd,vmkernel) logs to have full issue picture.

 

VM Consolidation – Survival Guide

VM Consolidation – Survival Guide

survival-guide

Survival guide for any vm snaphost consolidation problems all in one place :

Note! Make sure any backup software is turned off or that all jobs are stopped. A reboot of the backup server is required to clear any potential residual locks.

  1. Restart vc service – https://kb.vmware.com/kb/1003895
  1. Restart the management agents on the ESXi cluster where problematic vms are working

#services.sh restart   – https://kb.vmware.com/kb/1003490, or manually verify to determine “who” is holding the lock

3. Use vmfstools (-D) command against vm snapshot files:

/vmfs/volumes/<datastore># vmkfstools -D <file name>
You see an output similar to:

[root@test-esx1 testvm]# vmkfstools -D test-000008-delta.vmdk
Lock [type 10c00001 offset 45842432 v 33232, hb offset 4116480
gen 2397, mode 2, owner 00000000-00000000-0000-000000000000mtime 5436998]<————–MAC address of lock owner
RO Owner[0] HB offset 3293184 xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxxx <——————————MAC address of read-only lock owner
Addr <4, 80, 160>, gen 33179, links 1, type reg, flags 0, uid 0, gid 0, mode 100600
len 738242560, nb 353 tbz 0, cow 0, zla 3, bs 2097152

//more information in kb: https://kb.vmware.com/kb/10051

If  esxi holding lock you can restart mgmt agents as per above advice or migrate all vms and reboot host or determine which process is holding the lock – just run one of these commands:

# lsof file

# lsof | grep -i file

For example:

# lsof | grep test02-flat.vmdk

You should see an output similar to:

COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME

71fd60b6- 3661 root 4r REG 0,9 10737418240 23533 Test02-flat.vmdk

Check the process with the PID returned in above, in our  example:

# ps -ef | grep 3661

to kill the process, run the command:

# kill

All in all when we solve “locks” problems we can continue vm consolidation process :

  1. Connect to the ESXi where is problematic vm directly
  1. Power off problematic vm
  1. Disable CBT for the virtual machine (very ofter ctk files are corrupt, for example we run backup job on vm with active snapshot – this is unsupported config) For more information, see: http://kb.vmware.com/kb/1031873

6.Remove  any files ending with the *-ctk.vmdk file extension in the virtual machine directory.

  1. Enable CBT for the virtual machine again, see: http://kb.vmware.com/kb/1031873
  1. Remove and add vm to inventory (just to verify vm configuration integrity, in case any vmx problems you got error message and you need correct vm config), more information in kb: https://kb.vmware.com/kb/1003743
  1. Create a snapshot:

Right-click the virtual machine.

Click Snapshot.

Click Take Snapshot.

  1. Perform a Delete All operation:

Right-click the virtual machine.

Click Snapshot.

Click Snapshot Manager.

Click Delete All.

TIP:  To verify snapshots are rejoining run the commands:

#watch “ls -lhut –time-style=full-iso *-delta.vmdk”

#watch “ls -lh –full-time *-delta.vmdk *-flat.vmdk”

//more info in kb: https://kb.vmware.com/kb/1007566

  1. Power on vm and verify fix

 

However if above do not work/solve the problem we have two alternate options:

  1. a) clone or storage vmotion problematic vm’s to different datastore
  1. b) use VMware converter and perform v2v operation

That’s it – my survival guide for any vm snapshot consolidation problems – wondering if you have any add ons or different approach view ?

Adding a sound card to ESXi hosted VM

Adding a sound card to ESXi hosted VM

Sound Card in vSphere Virtual Machine is an unsupported configuration. This is feature dedicated to Virtual Machines created in VMware Workstation. However, you can still add HD Audio device to vSphere Virtual Machine by manually editing .vmx file. I have tested it in our lab environment and it works just fine.

Below  procedure how to do this:

1. Verify storage where VM with no soundcard reside

soundcard1

  1. Login with root to the ESXi host where VM reside using SSH.
    3. Navigate to /vmfs/volumes/<VM LUN>/<VM folder>
    In my example it was:
    ~# cd /vmfs/volumes/Local_03esx-mgmt_b/V11_GSS_DO
    4. Shut down problematic VM
    5. Edit .vmx file using VI editor.

IMPORTANT:
Make a backup copy of the .vmx file. If your edits break the virtual machine, you can roll back to the original version of the file.
More information about editing files on ESXi host, refer to KB article: https://kb.vmware.com/kb/1020302

  1. Once you have open vmx to edit, navigate to the bottom of the file and add following lines to the .vmx configuration file:
    sound.present = “true”
    sound.allowGuestConnectionControl = “false”
    sound.virtualDev = “hdaudio”
    sound.fileName = “-1”
    sound.autodetect = “true”
  2. Save file and Power-On Virtual machine.
  3. Once it have booted, and you have enabled Windows Audio Service, sound will work fine.

If you go to “Edit Settings” of the VM, you can see information that device is unsupported. Please be aware that if after adding sound card to you virtual machine, you may exprience any kind of unexpected behavior (tip: in our lab env work this config without issues).