Browsed by
Tag: host

Mystery of the broken VM

Mystery of the broken VM

Today my colleague (vmware administrator) asked me a small favour – help to perform RCA (root cause analyze) – related to one of production VM that had recently a problem. VM for some reason was migrated (collegue stands that it happened without administrative intervention) to other ESXi host that do not have proper network config for this vm – this caused outage for whole system. I asked about issue time and we went deeper into logs in order to find out what exacly happened.
We started from VM logs called vmware.log.

vmx| I120: VigorTransportProcessClientPayload: opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Receiving PowerState.InitiatePowerOff request.
vmx| I120: Vix: [36833 vmxCommands.c:556]: VMAutomation_InitiatePowerOff. Trying hard powerOff
| vmx| I120: VigorTransport_ServerSendResponse opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Completed PowerState request.
vmx| I120: Stopping VCPU threads…
vmx| I120: VMX exit (0).
| vmx| I120: AIOMGR-S : stat o=29 r=50 w=25 i=63 br=795381 bw=237700 
vmx| I120: OBJLIB-LIB: ObjLib cleanup done.
 -> vmx| W110: VMX has left the building: 0. 

According to VMware KB: 2097159 „VMX has left the building: 0” – is an informational message and is caused by powering off a Virtual Machine or performing a vMotion of the Virtual Machine to a new host. It can be safely ignored so we had a first clue related to vm migration.
Next we moved to fdm.log (using time stamps form vmware.log) and surprise surprise – VM for some reason was powered off 1h before issue occur:

vmx local-host: local power state=powered off; assuming user power off; global power state=powered off
verbose fdm[FF9DE790] [Originator@6876 sub=Invt opID=SWI-95f1a3f] [VmStateChange::SaveToInventory] vm /vmfs/volumes/vmx from __localhost__ changed inventory cleanPwrOff=1

Now we need to find out why VM was powered off – so we went through hostd logs using  grep with VM name:
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3E9C1B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] State Transition (VM_STATE_OFF -> VM_STATE_CREATE_SNAPSHOT)
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotConfigInfoReadEx: Creating new snapshot dictionary, ‘/vmfs/volumes/.vmsd.usd’.
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee
user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/ -000001.vmdk’ to ‘/ user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/000001.vmdk’ to ‘/vmfs/volumes/.vmdk’. 

-> info hostd[3F180B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 485016 : Virtual machine disks consolidated successfully on in cluster in ha-datacenter. 

So we confirmed that shortly before issue occured VM was powered off for snapshot consolidation – at this point we assumed that this might be related to general issue but we decided to verify vobd log (less verbose than vmkernel and give good view about esxi healt in case of storage and networking) :

uplink.transition.down] Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN130. 2 uplinks up. Failed criteria: 128 Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN8 ESOK2. 2 uplinks up. Failed criteria:128
Lost uplink redundancy on virtual switch “vSwitch0”. Physical NIC vmnic0 is down. Affected port groups: “ISCSI”, “Management Network”

This is it, direct hit to the reason of  main issue – network adapter problem, with new time stamps just for confirmation we went back to fdm.log :

fdm[FFC67B70] [Originator@6876 sub=Notifications opID=SWI-69d87e37] [Notification::AddListener] Adding listener of type Csi::Notifications::VmStateChange: Csi::Policies::GlobalPolicy (listeners = 4)
–> Protected vms (0):
–> Unprotect request vms (0):
–> Locked datastores (0):
–> Events (4):
–> EventEx=
com.vmware.vc.HA.ConnectedToMaster vm= host=host-24 tag=host-24:1225982388:0
–> EventEx=com.vmware.vc.HA.AllHostAddrsPingable vm= host=host-24 tag=host-24:1225982388:1
–> EventEx=com.vmware.vc.HA.AllIsoAddrsPingable vm= host=host-24 tag=host-24:1225982388:2
–> EventEx=com.vmware.vc.ha.VmRestartedByHAEvent

With VMware KB: Determining which virtual machines are restarted during a vSphere HA failover (2036555) – confirm that ha react to network outage and power-on vm on diffrenet not afected esxi.
Conclusion – always correlate problematic vm log with esxi (host,vobd,vmkernel) logs to have full issue picture.

 

ESXi host connection lost due to CDP/LLDP protocol

ESXi host connection lost due to CDP/LLDP protocol

You can observe some random and intermittent loss of connection to ESXi 6.0 host running on Dell servers (both Rack and Blade). It’s caused by a bug with Cisco Discovery Protocol /Link Layer Discovery Protocol.  It can be also seen while generating VMware support log bundle because during this process these protocol are also used to include information about the network.

 

What are these protocols for? Both of them perform similar roles in the local area network. They are used by network devices to advertise their identity, capabilities and neighbors. The main difference is that CDP is a Cisco proprietary protocol and LLDP is vendor-neutral. There are also other niche protocols like Nortel Discovery Protocol, Foundry Discovery Protocol or Link Layer Topology.

CDP and LLDP are also compatible with VMware virtual switches and thereby they can gather and display information about the physical switches.  CDP is available for both standard and distributed switches whilst LLDP is available only for distributed virtual switches since vSphere 5.0

cdp

Cisco Discovery Protocol information displayed on vSwitch level.

 
There is currently no resolution for this bug but thanks to the VMware Technical Support the workaround described below is available.

 

Turn off the CDP for each  vSwitch:

# esxcfg-vswitch –B down vSwitchX

You can also verify the current status of CDP using fallowing command:

# esxcfg-vswitch –b vSwitchX

This simple task will resolve the problem with random connection loss of ESXi hosts. Anyway it will not solve the problem with loss of connection during generation of log bundle.

To confirm that the prblem exist you can simply run fallowing command:

# vm-support –w /vmfs/volumes/datastore_name

Even though we turned off the CDP, during log generation process ESXi are using it to gather information about network topology.

To fix it you have to download this script called disablelldp2.py and perform the steps below:

  1. Copy the script to a datastore which is shared with all hosts,
  2. Open SSH to an ESXi host,
    1. Move to a destination where you copied the script,
    2. Grant the permission: # chmod 555 disablelldp2.py,
  3. Run the script: ./disablelldp2.py,
  4. After the script is executed move to /etc/rc.local.d and edit local.sh file. It should look like this:

#!/bin/sh

# local configuration options

# Note: modify at your own risk!  If you do/use anything in this # script that is not part of a stable API (relying on files to be in # specific places, specific tools, specific output, etc) there is a # possibility you will end up with a broken system after patching or # upgrading.  Changes are not supported unless under direction of # VMware support.

ORIGINAL_FILE=/sbin/lldpnetmap

MODIFIED_FILE=/sbin/lldpnetmap.original

if test -e “$MODIFIED_FILE”
then
echo “$MODIFIED_FILE already exists.”
else
mv “$ORIGINAL_FILE” “$MODIFIED_FILE”
echo “Omitting LLDP Script.” > “$ORIGINAL_FILE”
chmod 555 “$ORIGINAL_FILE”
fi
exit 0

  1. Restart the ESXi server and run vm-support command to confirm that the problem is solved.