Browsed by
Tag: logs

Mystery of the broken VM

Mystery of the broken VM

Today my colleague (vmware administrator) asked me a small favour – help to perform RCA (root cause analyze) – related to one of production VM that had recently a problem. VM for some reason was migrated (collegue stands that it happened without administrative intervention) to other ESXi host that do not have proper network config for this vm – this caused outage for whole system. I asked about issue time and we went deeper into logs in order to find out what exacly happened.
We started from VM logs called vmware.log.

vmx| I120: VigorTransportProcessClientPayload: opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Receiving PowerState.InitiatePowerOff request.
vmx| I120: Vix: [36833 vmxCommands.c:556]: VMAutomation_InitiatePowerOff. Trying hard powerOff
| vmx| I120: VigorTransport_ServerSendResponse opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Completed PowerState request.
vmx| I120: Stopping VCPU threads…
vmx| I120: VMX exit (0).
| vmx| I120: AIOMGR-S : stat o=29 r=50 w=25 i=63 br=795381 bw=237700 
vmx| I120: OBJLIB-LIB: ObjLib cleanup done.
 -> vmx| W110: VMX has left the building: 0. 

According to VMware KB: 2097159 „VMX has left the building: 0” – is an informational message and is caused by powering off a Virtual Machine or performing a vMotion of the Virtual Machine to a new host. It can be safely ignored so we had a first clue related to vm migration.
Next we moved to fdm.log (using time stamps form vmware.log) and surprise surprise – VM for some reason was powered off 1h before issue occur:

vmx local-host: local power state=powered off; assuming user power off; global power state=powered off
verbose fdm[FF9DE790] [Originator@6876 sub=Invt opID=SWI-95f1a3f] [VmStateChange::SaveToInventory] vm /vmfs/volumes/vmx from __localhost__ changed inventory cleanPwrOff=1

Now we need to find out why VM was powered off – so we went through hostd logs using  grep with VM name:
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3E9C1B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] State Transition (VM_STATE_OFF -> VM_STATE_CREATE_SNAPSHOT)
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotConfigInfoReadEx: Creating new snapshot dictionary, ‘/vmfs/volumes/.vmsd.usd’.
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee
user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/ -000001.vmdk’ to ‘/ user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/000001.vmdk’ to ‘/vmfs/volumes/.vmdk’. 

-> info hostd[3F180B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 485016 : Virtual machine disks consolidated successfully on in cluster in ha-datacenter. 

So we confirmed that shortly before issue occured VM was powered off for snapshot consolidation – at this point we assumed that this might be related to general issue but we decided to verify vobd log (less verbose than vmkernel and give good view about esxi healt in case of storage and networking) :

uplink.transition.down] Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN130. 2 uplinks up. Failed criteria: 128 Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN8 ESOK2. 2 uplinks up. Failed criteria:128
Lost uplink redundancy on virtual switch “vSwitch0”. Physical NIC vmnic0 is down. Affected port groups: “ISCSI”, “Management Network”

This is it, direct hit to the reason of  main issue – network adapter problem, with new time stamps just for confirmation we went back to fdm.log :

fdm[FFC67B70] [Originator@6876 sub=Notifications opID=SWI-69d87e37] [Notification::AddListener] Adding listener of type Csi::Notifications::VmStateChange: Csi::Policies::GlobalPolicy (listeners = 4)
–> Protected vms (0):
–> Unprotect request vms (0):
–> Locked datastores (0):
–> Events (4):
–> EventEx=
com.vmware.vc.HA.ConnectedToMaster vm= host=host-24 tag=host-24:1225982388:0
–> EventEx=com.vmware.vc.HA.AllHostAddrsPingable vm= host=host-24 tag=host-24:1225982388:1
–> EventEx=com.vmware.vc.HA.AllIsoAddrsPingable vm= host=host-24 tag=host-24:1225982388:2
–> EventEx=com.vmware.vc.ha.VmRestartedByHAEvent

With VMware KB: Determining which virtual machines are restarted during a vSphere HA failover (2036555) – confirm that ha react to network outage and power-on vm on diffrenet not afected esxi.
Conclusion – always correlate problematic vm log with esxi (host,vobd,vmkernel) logs to have full issue picture.

 

vSphere 6.5 – enhanced logging

vSphere 6.5 – enhanced logging

 

vSphere 6.5 introduces audit logging, before vSphere 6.5  logs were more focused on finding root causes of a problem – not releate deep  to IT operations or security use cases. For example, if a virtual machine was reconfigured from one storage adapter to another in logs we would find only “Virtual Machine <name> reconfigured”.

 But now logs which are coming from vCenter via Syslog will contain data from vCenter Events. These logs will clearly show “Before” and “After” setting changes.  This enhances the ability of IT and Security administrators to troubleshoot issues by providing information what was exactly changed in the vSphere environment.

 security1

Enhanced logging summary:

  • Improved vCenter/ESXi event logs quality
    • Informative auditing without having to enable verbose mode
  • Structured vCenter Events SysLog Stream
    • Minimal VC overhead
    • Simplified deployment
    • Enables upper level intelligence
  • Customer auditing examples:
    • VM was moved to a wrong network
    • VM disk was deleted by accident
    • VM was under/over provisioned

Now let’s see how to enable streaming VC events to remote syslog server :

security2

security3

NOTE!!! This feature is not available on Windows VC

1. Enable event syslog:

security4

2. Configure connection parameters:

security5

And finally let’s look at some examples of vCenter events audit quality:

security6