Browsed by
Tag: esxi

vCenter Appliance 6.0 U3 email notifications are not sent when multiple email addresses are defined in an alarm action

vCenter Appliance 6.0 U3 email notifications are not sent when multiple email addresses are defined in an alarm action

Recently I tried to configure email notifications on my lab vCenter Server Appliance (6.0u3), but  experience issue:

 “Diagnostic-Code: SMTP;550 5.7.60 SMTP; Client does not have permissions to send as this sender”

I tried to use solution from kb: https://kb.vmware.com/kb/2075153 but apparently, the solution does not work with latest 6.0.x appliance!

After some research and digging deeper (header analysis ), it seems that root cause was invalid return path in the email header. To resolve this you need to edit two system files:

1. SSH to VCSA and enable shell:

#Command>shell.set –enabled True

# Command>shell

2. Open catalog : /etc/sysconfig

mail1

3. Edit “mail” using vi and made a change as in below prtsc:

#vi email

mail2

  • simply check using cat:

mail3

4. In the same catalog edit “sendmail” file adding a domain name “SENDMAIL_GENERICS_DOMAIN=”:

mail4

5. Subsequently, go to /etc/mail catalog and add a user to mask root in “genericstable”:

mail56. Regenerate table:

# makemap -r hash /etc/mail/genericstable.db < /etc/mail/genericstable

7. create file sendmail.mc:

#/sbin/conf.d/SuSEconfig.sendmail -m4 > /sendmail.mc

Note. Do not edit file “sendmail” like in abowe procedure

8. Double check if “sendmail.cf” file in catalog /etc exist if yes then change it a name:

   #mv /etc/sendmail.cf /etc/sendmail.cf.orig

9. Create a new config file:

#m4 /sendmail.mc > /etc/sendmail.cf

10. Open config file “sendmail.cf” (vi) and add IP SMTP/Exchange (DS[xxx.xxx.xxx.xxx] ) server in environment :

mail611. Restart sendmail service:

# /etc/init.d/sendmail restart

 

Now it should work fine !

Esxi Net.ReversePathFwdCheckPromisc Advanced setting

Esxi Net.ReversePathFwdCheckPromisc Advanced setting

During deployment of Cisco proxy appliance, we discovered a problem. According to cisco to resolve this problem qa“Net.ReversePathFwdCheckPromisc” should be set to “1” on ESX’s.

The question is – do you know any negative effects which such change could cause. We believe that there must be a reason why by default this option is set to 0 ? That’s why I decided to figure our what it is used for.

After some research I was able to find answer:

Setting – > Net.ReversePathFwdCheckPromisc = 1 — > this is when you are expecting the reverse filters to filter the mirrored packets, to prevent multicast packets getting duplicated.

Note: If the value of the Net.ReversePathFwdCheckPromisc configuration option is changed when the ESXi instance is running, you need to enable or re-enable the promiscuous mode for the change in the configuration to take effect.

The reason you would use promiscuous mode depends on the requirement and configuration. Please check the below KB Article:

http://kb.vmware.com/kb/1004099

  • This option is not enabled by default because we are not aware of the vSwitch configuration and can’t predict what it could be as it has configurable options.

VMware does not advise to enable this option if we do not have a use case scenario with teamed uplinks and have monitoring software running on the VMs ideally. As When promiscuous mode is enabled at the port group level, objects defined within that port group have the option of receiving all incoming traffic on the vSwitch. Interfaces and virtual machines within the port group will be able to see all traffic passing on the vSwitch causing VM performance impact.

Should the ESX server be rebooted for this change to take effect:  answer is – > Yes, and Yes you can enable this option with the VMs running on the existing portgroup.

Do you have any interesting virtualization related question?

 

ESXi and Likewise – troubleshooting guide – part 2

ESXi and Likewise – troubleshooting guide – part 2

In last part of this small series, we discussed theoretical background about components and technology related for adding ESX host to windows AD environment. Now it is time to describe troubleshooting options and some real life problems with solutions.

Let’s start from dividing all ESXi/Likewise issues into categories:

  1. Domain Join Failures

Here are most often reasons that an attempt to join a domain fails:

  • The user name or password of the account used to join the domain is incorrect.
  • The name of the domain is mistyped.
  • The name of the OU is mistyped.
  • The local hostname is invalid.
  • The domain controller is unreachable from the client because of a firewall or because the NTP service is not running on the domain controller.
  • Verify that the Name Server Can Find the Domain

# nslookup <AD Domain>

  • Make Sure the Client Can Reach the Domain Controller

verify that ESX host can reach the domain controller by pinging it.

  • Verify that Outbound Ports Are Open
  • Port 88 – Kerberos authentication
  • Port 123 – NTP
  • Port 135 – RPC
  • Port 137 – NetBIOS Name Service
  • Port 139 – NetBIOS Session Service (SMB)
  • Port 389 – LDAP
  • Port 445 – Microsoft-DS Active Directory, Windows shares (SMB over TCP)
  • Port 464 – Kerberos – change/password changes
  • Port 3268- Global Catalog search
  • Check DNS Connectivity

make sure the nameserver entry in /etc/resolv.conf contains the IP address of a DNS server that can resolve the name of the domain you are trying to join.

  • Make Sure nsswitch.conf Is Configured to Check DNS for Host Names

The /etc/nsswitch.conf file must contains the following line:

hosts: files dns

  • Ensure that DNS Queries Are Not Using the Wrong Network Interface Card

If the ESX host is multi-homed, the DNS queries might be going out the wrong network interface card. Temporarily disable all the NICs except for the card on the same subnet as your domain controller or DNS server and then test DNS lookups to the AD domain. If this works, re-enable all the NICs and edit the local or network routing tables so that the AD domain controllers are accessible from the host.

  • Determine Whether the DNS Server Is Configured to Return SRV Records

Your DNS server must be set to return SRV records so the domain controller can be located. It is common for non-Windows (bind) DNS servers to not be configured to return SRV records.

Diagnose by executing the following command:

nslookup -q=srv _ldap._tcp. ADdomainToJoin.com

  • Make Sure that the Global Catalog Is Accessible

The global catalog for Active Directory must be accessible. Diagnose by executing the following command:

nslookup -q=srv _ldap._tcp.gc._msdcs. ADrootDomain.com

From the list of IP addresses in the results, choose one or more addresses and test whether they are accessible on Port 3268 by using telnet.

  • Verify that the Client Can Connect to the Domain on Port 123

Windows time service must be running on the domain controller.

On a Linux computer, run the following command as root:

ntpdate -d -u DC_hostname

  1. Log-in/Authentication issues
  • Make Sure You Are Joined to the Domain

Check ‘lw-lsa get-status’

  • Clear the Cache

Clear the cache to ensure that the client computer recognizes the user’s ID.

# ad-cache –delete-all

Clear the Likewise Kerberos cache to make sure there is not an issue. Execute the following command at the shell prompt with the user account that you are troubleshooting:

~#kdestroy

  • Check the Status of the Likewise Authentication Daemon

#/etc/init.d/lsassd status

  • Check Communication between the Likewise Daemon and AD

verify that the you can ping DC from ESX host.

  • Make Sure the AD Authentication Provider Is Running

# lw-lsa get-status

If the result will not include the AD authentication provider or will indicate that it is offline restart the authentication daemon

  • Check whether you can log on with SSH by executing the following command:

ssh DOMAIN\\username@localhost

  1. Lsassd crash due to various reasons such as during trust enumeration etc.
  • analyze the lsassd,netlogond,lwiod logs, see where exactly where likewise daemon is crashing.
  • look into the hostd logs and tcpdump to get more info
  1. Kerberos related issues
  • start to look into the packet capture (both sites esxi and ad) to see if we’re getting proper TGT and TGS.

//can be related to Kerberos cache so in this case empty the Kerberos cache using mentioned  ‘kdestory’ command.

  1. Hostd crash in Likewise code
  • Gather full log bundle and engage VMware GSS
  1. Windows AD server related issues
  • Gather guest OS logs and engage MS Support.

Ok., so now we have in one place all troubleshooting options and methodology, now it is time for real life story experience based on one of my last service requests: Customer is unable to log in using Active Directory credentials. It shows invalid credentials even though “Authentication Services” shows that host is joined into domain correct domain. The issue is seen on most of the hosts within the environment. Only 2 hosts do not suffer from the problem – cannot find any difference in configuration. Customer running latest 6.0 build: 4600944

Some other symptoms observed during troubleshooting issue step by step:

  1. Tried to disjoin server outside the domain using vSphere Client GUI on the host connected to vCenter – host stops responding unless we restart hostd. Restarting all management agents hangs on likewise agent for an infinite time.
  2. Unable to stop Active Directory Service – server not responding. After restarting hostd, or entire host – server back to normal operational state
  3. Change Active Directory Service to not start with the host -> restart ESXi – works
  4. Check auth type – now ESXi states that it is Local Authentication (so after all the restarts, finallly ESXi left the domain)
  5. Add host once again to the domain – host stops responding. Restart hostd – works fine
  6. Check auth type – ESXi states that he is joined to domain.
  7. Try to add permissions to the domain users – unable to select domain to assign permissions
  8. From AD perspective – ESXi account is refreshed

Troubleshooting Action Taken

===============

  1. Verify if likewise agents is up and running (It is)
  2. Restart likewise agent on the hosts (no impact on issue)
  3. Add advanced setting UserVars.ActiveDirectoryPreferredDomainControllers as per KB https://kb.vmware.com/kb/2107385 – Didn’t help
  4. To exclude any firewall issues blocking Domain controller traffic: ~# esxcli network firewall unload and retry login with domain account- Didnt help
  5. Increased likewise agent logging to debug and:
  6. a) Re-try domain authentication to see log entries
  7. b) Tried to leave -> rejoin domain using CLI (leave succesful, rejoin causes host to hang again unless we reboot host)
  8. Verify known issues in 6.0 related to authentication with AD – issues resolved in 6.0U1, while customer using latest patch

 

Log Analysis

  1. Trying to stop LWSMD using SSH

[~] /etc/init.d/lwsmd stop

watchdog-lwsmd: Terminating watchdog process with PID 36150 Stopping Likewise Service Manager [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] …failed

 

  1. Retry domain authentication with debug likewise logging (authentication does not succeed):

20161208115138:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass.XXX.COM

20161208115138:DEBUG:lsass:MemCacheFindGroupByName():memcache.c:1081: Error code: 40017 (symbol: LW_ERROR_NOT_HANDLED)

20161208115138:DEBUG:lsass:LsaSrvFindProviderByName():state.c:128: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:DEBUG:lsass:LsaSrvProviderServicesDomain():provider.c:151: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:VERBOSE:lsass:LsaAdBatchMarshal():batch_marshal.c:525: Did not find object by NT4 name ‘ESX Admins’

20161208115138:DEBUG:lsass:LsaAdBatchFindSingleObject():batch.c:1388: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_FindObjectByNameTypeNoCache():online.c:3519: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_OnlineFindObjectByName():online.c:4129: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115138:DEBUG:lsass:LsaSrvFindGroupAndExpandedMembers():api2.c:1626: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115338:VERBOSE:lsass:LsaSrvIpcCheckPermissions():ipc_state.c:79: Permission granted for (uid = 0, gid = 0, pid = 169257) to open LsaIpcServer

20161208115338:VERBOSE:lsass-ipc:lwmsg_peer_log_accept():peer-task.c:271: (session:f09bcf7743520e1d-b414124c53159168) Accepted association 0x1f0e5be8

20161208115338:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass. 1. Trying to stop LWSMD using SSH

[root@plpa2ex19irvm:~] /etc/init.d/lwsmd stop

watchdog-lwsmd: Terminating watchdog process with PID 36150 Stopping Likewise Service Manager [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] …failed

 

  1. Retry domain authentication with debug likewise logging (authentication does not succeed):

20161208115138:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass.XXX.COM

20161208115138:DEBUG:lsass:MemCacheFindGroupByName():memcache.c:1081: Error code: 40017 (symbol: LW_ERROR_NOT_HANDLED)

20161208115138:DEBUG:lsass:LsaSrvFindProviderByName():state.c:128: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:DEBUG:lsass:LsaSrvProviderServicesDomain():provider.c:151: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:VERBOSE:lsass:LsaAdBatchMarshal():batch_marshal.c:525: Did not find object by NT4 name ‘ESX Admins’

20161208115138:DEBUG:lsass:LsaAdBatchFindSingleObject():batch.c:1388: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_FindObjectByNameTypeNoCache():online.c:3519: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_OnlineFindObjectByName():online.c:4129: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115138:DEBUG:lsass:LsaSrvFindGroupAndExpandedMembers():api2.c:1626: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115338:VERBOSE:lsass:LsaSrvIpcCheckPermissions():ipc_state.c:79: Permission granted for (uid = 0, gid = 0, pid = 169257) to open LsaIpcServer

20161208115338:VERBOSE:lsass-ipc:lwmsg_peer_log_accept():peer-task.c:271: (session:f09bcf7743520e1d-b414124c53159168) Accepted association 0x1f0e5be8

20161208115338:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass.XXX.COM

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX’, site ‘<null>’ with flags 100

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX.com’, site ‘<null>’ with flags 100

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX.com’, site ‘<null>’ with flags 140

20161208115338:DEBUG:netlogon:LWNetCacheDbQuery():lwnet-cachedb.c:1079: Cached entry not found: XXX.com, , 1

20161208115338:DEBUG:netlogon:LWNetSrvGetDCName():dcinfo.c:128: Error at ../netlogon/server/api/dcinfo.c:128 [code: 1355]

20161208115338:DEBUG:netlogon:LWNetTransactGetDCName():ipc_client.c:249: Error at ../netlogon/client/ipc_client.c:249 [code: 1355]

20161208115338:DEBUG:netlogon:LWNetGetDCNameExt():dcinfo.c:133: Error at ../netlogon/client/dcinfo.c:133 [code: 1355]

 

  1. Try to rejoin domain (which causes host to hang in the end):

20161214123838:VERBOSE:lsass:LsaSrvIpcCheckPermissions():ipc_state.c:79: Permission granted for (uid = 0, gid = 0, pid = 39070) to open LsaIpcServer

20161214123838:VERBOSE:lsass-ipc:lwmsg_peer_log_accept():peer-task.c:271: (session:6b1bb0e33d95252a-e893c9a774c67d8e) Accepted association 0x1f07fe00

20161214123838:VERBOSE:lwreg:RegDbOpenKey():sqldb.c:1032: Registry::sqldb.c RegDbOpenKey() finished

20161214123838:DEBUG:lwreg:RegDbGetKeyValue_inlock():sqldb_p.c:1227: Error at ../lwreg/server/providers/sqlite/sqldb_p.c:1227 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:RegDbGetValueAttributes_inlock():sqldb_schema.c:846: Error at ../lwreg/server/providers/sqlite/sqldb_schema.c:846 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:VERBOSE:lwreg:SqliteGetValueAttributes_Internal():regschema.c:360: Registry::sqldb.c SqliteGetValueAttributes_Internal() finished

20161214123838:DEBUG:lwreg:SqliteGetValue():sqliteapi.c:887: Error at ../lwreg/server/providers/sqlite/sqliteapi.c:887 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:RegTransactGetValueW():clientipc.c:810: Error at ../lwreg/client/clientipc.c:810 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:LwNtRegGetValueA():regntclient.c:801: Error at ../lwreg/client/regntclient.c:801 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:RegShellUtilGetValue():rsutils.c:1463: Error at ../lwreg/shellutil/rsutils.c:1463 [code: 40700]

20161214123838:DEBUG:LwpsLegacyGetDefaultJoinedDomain():lsapstore-backend-legacy-internal.c:711: -> 0 (ERROR_SUCCESS) (EE = 685)

20161214123838:DEBUG:LsaPstoreGetPasswordInfoW():lsapstore-main.c:109: -> 2692 (NERR_SetupNotJoined) (EE = 80)

20161214123838:DEBUG:LsaPstoreGetPasswordInfoA():lsapstore-main-a.c:89: -> 2692 (NERR_SetupNotJoined) (EE = 71)

20161214123838:DEBUG:lsass:AD_GetMachineAccountInfoA():machinepwdinfo.c:91: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161214123838:DEBUG:lsass:AD_IoctlGetMachineAccount():ioctl.c:102: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161214123838:DEBUG:lsass:AD_ProviderIoControl():provider-main.c:4377: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161214123838:DEBUG:lsass:LsaSrvProviderIoControl():provider.c:99: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX.com’, site ‘<null>’ with flags 140

20161208115338:DEBUG:netlogon:LWNetCacheDbQuery():lwnet-cachedb.c:1079: Cached entry not found: XXX.com, , 1

20161208115338:DEBUG:netlogon:LWNetSrvGetDCName():dcinfo.c:128: Error at ../netlogon/server/api/dcinfo.c:128 [code: 1355]

20161208115338:DEBUG:netlogon:LWNetTransactGetDCName():ipc_client.c:249: Error at ../netlogon/client/ipc_client.c:249 [code: 1355]

At this stage we decide to gather network packets and analyze communication between esxi nad DC, time show that this was a good direction:

//packet capture methodology

  • eneble likewise loging:

/etc/init.d/lwsmd start

/usr/lib/vmware/likewise/bin/lwsm set-log-level trace /usr/lib/vmware/likewise/bin/lwsm set-log file /var/log/likewise.log tail -f /var/log/likewise.log

  • start tcp dump

tcpdump-uw -i 1 -n -s0 not tcp port 22 -C 50M -W 5 -w /var/log/capture10.pcap -vvv

 

  • add ESXi to domain from cli to capture comunication flow:

/usr/lib/vmware/likewise/bin/domainjoin-cli –loglevel verbose –logfile

join xxx.com plp24308
esxi and likewise2

We foud that on problematic ESXi hosts IPv6 communication was disabled but DC still using IPv6 in communication after couple test we confirm that after enabling IPv6 on ESXi or totally disabling it at   DC site:

https://support.microsoft.com/en-us/help/929852/how-to-disable-ipv6-or-its-components-in-windows

finally, there is no error with adding a host to the domain and DC authentication.

To clear more this whole situation we decided to perform additional investigation with VMware Support. GSS confirmed that they located the issue:

“…with the newer versions (vSphere 6) of ESXi in case it receives kdc in IPv6 format. In that situation the host will try to connect with IPv6. In case host has IPv6 disabled it will fail to join the domain “

//Bug is planned to be fixed on vSphere6.5U1

ESXi and Likewise – troubleshooting guide – part 1

ESXi and Likewise – troubleshooting guide – part 1

Last week I had to troubleshoot strange issue related to Active Directory integration with ESXI (6.0 version), this was motivation to prepare small (two articles) series about ESXi / Likewise integration and troubleshooting based on my latest experience.

VMware use Powerbroker Identity Services (Formerly known as Likewise) for adding ESX host to windows AD environment. To begin as usual is good to have some theoretical background about related components and technology and we describe it all in this part.

Below some of the basics of PAM and Kerberos:

PAM (Pluggable authentication module) – It’s a mechanism to integrate multiple low-level application schemes into a high-level APIs. All the application programs like hostd, dcui etc use PAM for creating users and authenticating them. They are referred to system-auth file which in turn refers to /etc/security/login.map. login.map maps ‘vpxa’ user to system-auth-local and all other users are mapped to system-auth-generic. PAM on its own can’t implement Kerberos ; It’s not possible for a PAM module to request a Kerberos service ticket (TGS) from a Kerberos key distribution center (KDC).
Kerberos – Kerberos protocol is designed to provide reliable authentication over open and insecure networks where communications between the hosts belonging to it may be intercepted. So we can say Kerberos is an authentication protocol for trusted hosts on untrusted networks.

esxi and likewise1

If you interested in more deep knowledge on this topic, take a look at this Kerberos tutorial: http://www.kerberos.org/software/tutorial.html

Before we describe communication with ESXi lets gathers together all Likewise components:
• Lsassd – The Likewise authentication daemon handles authentication, authorization, caching, and idmap lookups,
• Netlogond – Detects the optimal domain controller and global catalog and caches the data,
• Lwiod – The Likewise input-output service. It communicates over SMB with SMB servers,
• Caches – To maintain the current state and to improve performance, the Likewise agent caches information in several files, all of which are in /etc/likewise/db/
• lsass-adcache.filedb – Cache managed by the AD authentication provider,
• netlogon-cache.filedb – Domain controller affinity cache, managed by netlogond,
• pstore.filedb – Repository storing the join state and machine password.

OK, now it’s time to consider how Likewise extends Kerberos authentication to ESXi:

1. User logs in to ESX (c# or web client),
2. Username and password are sent to PAM,
3. pam_lsass.so library communicates with the Lsassd,
4. from username and password Lsassd generates a secret key,
5. using the secret key Lsassd request a TGT, from AD’s KDC,
6. The KDC verifies the secret key and then grants the ESXi Host a TGT,
7. ESXi host and the KDC exchange messages to authenticate the client,
8. Lsassd can use the TGT request service tickets for other services such as ssh.

To clarify more lets discuss important algoritms (netlogon) related to this process to address common questions:

1. How Netlogond finds the best DC (prioritization) ?

Likewise Netlogon obtains a list of candidate Domain Controllers using DNS. The algorithm for doing this is based on the algorithm used in Windows Netlogon. Each candidate Domain Controller which matches the site criteria is queried with a CLDAP request for the Netlogon attribute. The time to respond for each Domain Controller is stored as PingTime in the DomainControllerInfo output parameter. The Domain Controller with the lowest PingTime is returned to the caller.

2. Prefered DC

Netlogon attempts to find the domain controller which responds the quickest to CLDAP pings with a preference for domain controllers in the same site. The algorithm is rather complex 😉
If the request includes a site, then the query order is:

a) Preferred domain controller plugin with the requested site
b) DNS with the requested site

3. Domain Join Process
Domain join involves various steps and communication among various domain:
a) creating computer account in DC
b) creating machine account and setting password
c) saving machine account/password to pstore db and updating kerberos keytab

Last part of this article will discuss important packets in tcpdump – very important in case of troubleshooting problems with joining ESXi to domain :
1. CLDAP – Usage of CLDAP packets depends upon the attribute:

If attribute = netlogon -> These CLDAP pings are used by netlogond to verify the aliveness of the domain controller and also check whether the domain controller matches a specific set of requirements. netlogon version(NtVer) etc.
If attribute = time -> These’re used for selecting the nearest DC by netlogon during domain controller discovery phase.
2. ARP – Address resolution protocol is used for resolution of network layer address into link layer address i.e. IP address to MAC address. These’re common packets not specific to AD.

3. DNS – DNS queries related to srv records. Microsoft decided to use SRV records as a key part of the procedure whereby a client finds a domain controller. So where do these records come from? They are registered with DNS by the NetLogon service of a domain controller when it starts. There are actually quite a few of these records, but for right now let’s just look at two of them, the ones that have to do with domain controllers. They are in the following formats:
_ldap._tcp.dc._msdcs.dnsdomainname _ldap._tcp.sitename._sites.dc._msdcs.dnsdomainname
4. KRB5 : For Krb5 you would see four packets viz. AS-REQ, AS-REP, TGS-REQ, TGS-REP:
• AS_REQ is the initial user authentication request (i.e. made with kinit) This message is directed to the KDC component known as Authentication Server (AS);
• AS_REP is the reply of the Authentication Server to the previous request. Basically it contains the TGT (encrypted using the TGS secret key) and the session key (encrypted using the secret key of the requesting user);
• TGS_REQ is the request from the client to the Ticket Granting Server (TGS) for a service ticket. This packet includes the TGT obtained from the previous message and an authenticator generated by the client and encrypted with the session key;
• TGS_REP is the reply of the Ticket Granting Server to the previous request. Located inside is the requested service ticket (encrypted with the secret key of the service) and a service session key generated by TGS and encrypted using the previous session key generated by the AS;

5. LDAP : A standards-based protocol that is used for communication between directory clients and a directory service. LDAP is the primary directory access protocol for Active Directory. LDAP searches are the most common LDAP operations that are performed against an Active Directory domain controller. An LDAP search retrieves information about all objects within a specific scope that have certain characteristics, for example, the telephone number of every person in a department.
6. NBNS – NBNS serves much the same purpose as DNS does: translate human-readable names to IP addresses

7. RARP – Reverse Address Resolution Protocol i.e. converts MAC address to IP

8. SSH – Secure shell packets

9. TCP- Network traffic who uses TCP as transmission protocol viz. Kerberos,ssh,https,ldap etc.
Processes transmit data by calling on the TCP and passing buffers of data as arguments. The TCP packages the data from these buffers into segments and calls on the internet module [e.g. IP] to transmit each segment to the destination TCP.

10. SMB – Server Message Block,also known as Common Internet File Systems(CIFS) operates as application layer network protocol mainly used for providing shared access to files,printers,serial port, and miscellaneous communications between nodes on a network. It also provides an authenticated inter-process communication mechanism.
Now we are prepared for troubleshooting ! Stay tuned for next part 🙂

Mystery of the broken VM

Mystery of the broken VM

Today my colleague (vmware administrator) asked me a small favour – help to perform RCA (root cause analyze) – related to one of production VM that had recently a problem. VM for some reason was migrated (collegue stands that it happened without administrative intervention) to other ESXi host that do not have proper network config for this vm – this caused outage for whole system. I asked about issue time and we went deeper into logs in order to find out what exacly happened.
We started from VM logs called vmware.log.

vmx| I120: VigorTransportProcessClientPayload: opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Receiving PowerState.InitiatePowerOff request.
vmx| I120: Vix: [36833 vmxCommands.c:556]: VMAutomation_InitiatePowerOff. Trying hard powerOff
| vmx| I120: VigorTransport_ServerSendResponse opID=52A52E10-000000CE-7b-15-97b0 seq=26853: Completed PowerState request.
vmx| I120: Stopping VCPU threads…
vmx| I120: VMX exit (0).
| vmx| I120: AIOMGR-S : stat o=29 r=50 w=25 i=63 br=795381 bw=237700 
vmx| I120: OBJLIB-LIB: ObjLib cleanup done.
 -> vmx| W110: VMX has left the building: 0. 

According to VMware KB: 2097159 „VMX has left the building: 0” – is an informational message and is caused by powering off a Virtual Machine or performing a vMotion of the Virtual Machine to a new host. It can be safely ignored so we had a first clue related to vm migration.
Next we moved to fdm.log (using time stamps form vmware.log) and surprise surprise – VM for some reason was powered off 1h before issue occur:

vmx local-host: local power state=powered off; assuming user power off; global power state=powered off
verbose fdm[FF9DE790] [Originator@6876 sub=Invt opID=SWI-95f1a3f] [VmStateChange::SaveToInventory] vm /vmfs/volumes/vmx from __localhost__ changed inventory cleanPwrOff=1

Now we need to find out why VM was powered off – so we went through hostd logs using  grep with VM name:
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3DE40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx] State Transition (VM_STATE_POWERING_OFF -> VM_STATE_OFF)
info hostd[3E9C1B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/.vmx opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] State Transition (VM_STATE_OFF -> VM_STATE_CREATE_SNAPSHOT)
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotConfigInfoReadEx: Creating new snapshot dictionary, ‘/vmfs/volumes/.vmsd.usd’.
info hostd[3E9C1B70] [Originator@6876 sub=Libs opID=6197d370-76-97ee
user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/ -000001.vmdk’ to ‘/ user=vpxuser:VSPHERE.LOCAL\Administrator] SNAPSHOT: SnapshotCombineDisks: Consolidating from ‘/vmfs/volumes/000001.vmdk’ to ‘/vmfs/volumes/.vmdk’. 

-> info hostd[3F180B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 485016 : Virtual machine disks consolidated successfully on in cluster in ha-datacenter. 

So we confirmed that shortly before issue occured VM was powered off for snapshot consolidation – at this point we assumed that this might be related to general issue but we decided to verify vobd log (less verbose than vmkernel and give good view about esxi healt in case of storage and networking) :

uplink.transition.down] Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN130. 2 uplinks up. Failed criteria: 128 Uplink: vmnic0 is down. Affected portgroup: VM Network VLAN8 ESOK2. 2 uplinks up. Failed criteria:128
Lost uplink redundancy on virtual switch “vSwitch0”. Physical NIC vmnic0 is down. Affected port groups: “ISCSI”, “Management Network”

This is it, direct hit to the reason of  main issue – network adapter problem, with new time stamps just for confirmation we went back to fdm.log :

fdm[FFC67B70] [Originator@6876 sub=Notifications opID=SWI-69d87e37] [Notification::AddListener] Adding listener of type Csi::Notifications::VmStateChange: Csi::Policies::GlobalPolicy (listeners = 4)
–> Protected vms (0):
–> Unprotect request vms (0):
–> Locked datastores (0):
–> Events (4):
–> EventEx=
com.vmware.vc.HA.ConnectedToMaster vm= host=host-24 tag=host-24:1225982388:0
–> EventEx=com.vmware.vc.HA.AllHostAddrsPingable vm= host=host-24 tag=host-24:1225982388:1
–> EventEx=com.vmware.vc.HA.AllIsoAddrsPingable vm= host=host-24 tag=host-24:1225982388:2
–> EventEx=com.vmware.vc.ha.VmRestartedByHAEvent

With VMware KB: Determining which virtual machines are restarted during a vSphere HA failover (2036555) – confirm that ha react to network outage and power-on vm on diffrenet not afected esxi.
Conclusion – always correlate problematic vm log with esxi (host,vobd,vmkernel) logs to have full issue picture.

 

VMware Auto Deploy Configuration in vSphere 6.5

VMware Auto Deploy Configuration in vSphere 6.5

 

 

 

The architecture of auto deploy has changed in vSphere 6.5, one of the main difference is the ImageBuilder build in vCenter and the fact that you can create image profiles through the GUI instead of PowerCLI. That is really good news for those how is not keen on PowerCLI. But let’s go throgh the new configuration process of Auto Deploy. Below I gathered all the necessary steps to configure Auto Deploy in your environment.

  1. Enable Auto Deploy services on vCenter Server. Move to Administration -> System Configuration -> Related Objects, look for and start fallowing services:
  • Auto Deploy
  • ImageBuilder Service

You can change the startup type to start them with the vCenter server automatically as well.

Caution! In case you do not see any services like on the screan below, probably vmonapi and vmware-sca services are stopped.ad1

To start them, log in to vCenter Server through SSH and use fallowing commands:

#service-control  – -status         // to verify the status of these services

#service-control  – -start vmonapi vmware-sca       //to start services

ad2

Next, go back to Web Client and refresh the page.

 

  1. Prepare the DHCP server and configure DHCP scope including default gateway. A Dynamic Host Configuration Protocol (DHCP) scope is the consecutive range of possible IP addresses that the DHCP server can lease to clients on a subnet. Scopes typically define a single physical subnet on your network to which DHCP services are offered. Scopes are the primary way for the DHCP server to manage distribution and assignment of IP addresses and any related configuration parameters to DHCP clients on the network.

When basic DHCP scope settings are ready, you need to configure additional options:

  • Option 066 – with the Boot Server Host Name
  • Option 067 – with the Bootfile Name (it is a file name observed at Auto Deploy Configuration tab on vCenter Server – kpxe.vmw-hardwired)

ad3

  1. Configure TFTP server. For lab purposes I nearly always using the SolarWinds TFTP server, it is very easy to manage. You need to copy the TFTP Boot Zip files available at Auto Deploy Configuration page observed in step 2 to TFTP server file folder and start the TFTP service.

ad4

At this stage when you are try to boot you fresh server should get the IP Address and connect to TFTP server. In the  Discovered Hosts tab of Auto Deploy Configuration you will be able to see these host which received IP addresses and some information from TFTP server, but no Deploy Rule has been assigned to them.

ad5

  1. Create an Image Profile.

Go to Auto Deploy Configuration page -> Software Depots tab  and Import Software Depot

ad6

 

Click on Image Profiles so see the Image Profiles that are defined in this Software Depot.

ad7

The ESXi software depot contains the image profiles and software packages (VIBs) that are used to run ESXi. An image profile is a list of VIBs.

 

Image profiles define the set of VIBs to boot ESXi hosts with. VMware and VMware partners make image profiles and VIBs available in public depots. Use the Image Builder PowerCLI to  examine the depot and the Auto Deploy rule engine to specify which image profile to assign to which host. VMware customers can create a custom image profile based on the public image profiles and VIBs in the depot and apply that image profile to the host.

 

  1. Add Software Depot.

Click on Add Software Depot icon and add custom depot.

ad8

Next point in the newly created custom software depot select Image Profiles and click  New Image Profile.

ad9

I selected the minimum required VIBs to boot ESXi host which are:

  • esx-base 6.5.0-0.0.4073352 VMware ESXi is a thin hypervisor integrated into server hardware.
  • misc-drivers 6.5.0-0.0.4073352 This package contains miscellaneous vmklinux drivers
  • net-vmxnet3 1.1.3.0-3vmw.650.0.0.4073352 VMware vmxnet3
  • scsi-mptspi 4.23.01.00-10vmw.650.0.0.4073352 LSI Logic Fusion MPT SPI driver
  • shim-vmklinux-9-2-2-0 6.5.0-0.0.4073352 Package for driver vmklinux_9_2_2_0
  • shim-vmklinux-9-2-3-0 6.5.0-0.0.4073352 Package for driver vmklinux_9_2_3_0
  • vmkplexer-vmkplexer 6.5.0-0.0.4073352 Package for driver vmkplexer
  • vsan 6.5.0-0.0.4073352 VSAN for ESXi.
  • vsanhealth 6.5.0-0.0.4073352 VSAN Health for ESXi.
  • ehci-ehci-hcd 1.0-3vmw.650.0.0.4073352 USB 2.0 ehci host driver
  • xhci-xhci 1.0-3vmw.650.0.0.4073352 USB 3.0 xhci host driver
  • usbcore-usb 1.0-3vmw.650.0.0.4073352 USB core driver
  • vmkusb 0.1-1vmw.650.0.0.4073352 USB Native Driver for VMware

But the list could be different for you.

 

ad10

  1. Create a Deploy Rule.

ad11

ad12

ad13

ad14

ad15

  1. Activate Deploy Rule

ad16

  1. That’s it, now you can restart you host, it should boot and install according to your configuration now.
VMware Auto Deploy considerations

VMware Auto Deploy considerations

According to VMware definitione vSphere Auto Deploy can provision hundreds of physical hosts with ESXi software. You can specify the image to deploy and the hosts to provision with the image. Optionally, you can specify host profiles to apply to the hosts, a vCenter Server location (datacenter, folder or cluster), and assign a script bundle for each host. In short that is the tool to automate your ESXi deployment or upgrade.

As far as I know in particular on the Polish market it is not a widely used tool. However, it can be helpful for Integrator’s Companies to improve and make far more faster deployment of new environments. Furthermore, VMware claims the scripted or automated deployments should be used for every deployment with 5 or more hosts. Nonetheless, even if you are woring as a System Engineer or  at other implementation position I believe you are not installing new deployments every week..If that is every month – lucky you.

Well, is it really worth to prepare the AutoDeploy environment to deploy for instance 8 new hosts? – It depends.

IMHO, for such small deployments if you are really keen on making it a little bit fater the better way is to use kickstarts scripts. It can be much faster, expecially in case you are using them at least from time to time and you have prepared a good template (According the vSphere 6.5 I’m changing my mind a little bit due to changes which make AutoDpeloy preparation far more quicker)

However, Auto Deploy that’s not only deployment. It can be a kind of environment and change management. That can only be a specific kind of infrastructure where you use AutoDeploy to boot ESXi hosts instead of booting from local hard drives/SD cards.

Nevertheless, in Polands it is easier to meet classic PXE deployment booting from SAN than AutoDeploy. Is it the same trend seen around the world?

I am looking forward to hearing from you about yours experience with Auto Deploy.

VM Consolidation – Survival Guide

VM Consolidation – Survival Guide

survival-guide

Survival guide for any vm snaphost consolidation problems all in one place :

Note! Make sure any backup software is turned off or that all jobs are stopped. A reboot of the backup server is required to clear any potential residual locks.

  1. Restart vc service – https://kb.vmware.com/kb/1003895
  1. Restart the management agents on the ESXi cluster where problematic vms are working

#services.sh restart   – https://kb.vmware.com/kb/1003490, or manually verify to determine “who” is holding the lock

3. Use vmfstools (-D) command against vm snapshot files:

/vmfs/volumes/<datastore># vmkfstools -D <file name>
You see an output similar to:

[root@test-esx1 testvm]# vmkfstools -D test-000008-delta.vmdk
Lock [type 10c00001 offset 45842432 v 33232, hb offset 4116480
gen 2397, mode 2, owner 00000000-00000000-0000-000000000000mtime 5436998]<————–MAC address of lock owner
RO Owner[0] HB offset 3293184 xxxxxxxx-xxxxxxxx-xxx-xxxxxxxxxxxx <——————————MAC address of read-only lock owner
Addr <4, 80, 160>, gen 33179, links 1, type reg, flags 0, uid 0, gid 0, mode 100600
len 738242560, nb 353 tbz 0, cow 0, zla 3, bs 2097152

//more information in kb: https://kb.vmware.com/kb/10051

If  esxi holding lock you can restart mgmt agents as per above advice or migrate all vms and reboot host or determine which process is holding the lock – just run one of these commands:

# lsof file

# lsof | grep -i file

For example:

# lsof | grep test02-flat.vmdk

You should see an output similar to:

COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME

71fd60b6- 3661 root 4r REG 0,9 10737418240 23533 Test02-flat.vmdk

Check the process with the PID returned in above, in our  example:

# ps -ef | grep 3661

to kill the process, run the command:

# kill

All in all when we solve “locks” problems we can continue vm consolidation process :

  1. Connect to the ESXi where is problematic vm directly
  1. Power off problematic vm
  1. Disable CBT for the virtual machine (very ofter ctk files are corrupt, for example we run backup job on vm with active snapshot – this is unsupported config) For more information, see: http://kb.vmware.com/kb/1031873

6.Remove  any files ending with the *-ctk.vmdk file extension in the virtual machine directory.

  1. Enable CBT for the virtual machine again, see: http://kb.vmware.com/kb/1031873
  1. Remove and add vm to inventory (just to verify vm configuration integrity, in case any vmx problems you got error message and you need correct vm config), more information in kb: https://kb.vmware.com/kb/1003743
  1. Create a snapshot:

Right-click the virtual machine.

Click Snapshot.

Click Take Snapshot.

  1. Perform a Delete All operation:

Right-click the virtual machine.

Click Snapshot.

Click Snapshot Manager.

Click Delete All.

TIP:  To verify snapshots are rejoining run the commands:

#watch “ls -lhut –time-style=full-iso *-delta.vmdk”

#watch “ls -lh –full-time *-delta.vmdk *-flat.vmdk”

//more info in kb: https://kb.vmware.com/kb/1007566

  1. Power on vm and verify fix

 

However if above do not work/solve the problem we have two alternate options:

  1. a) clone or storage vmotion problematic vm’s to different datastore
  1. b) use VMware converter and perform v2v operation

That’s it – my survival guide for any vm snapshot consolidation problems – wondering if you have any add ons or different approach view ?

Adding a sound card to ESXi hosted VM

Adding a sound card to ESXi hosted VM

Sound Card in vSphere Virtual Machine is an unsupported configuration. This is feature dedicated to Virtual Machines created in VMware Workstation. However, you can still add HD Audio device to vSphere Virtual Machine by manually editing .vmx file. I have tested it in our lab environment and it works just fine.

Below  procedure how to do this:

1. Verify storage where VM with no soundcard reside

soundcard1

  1. Login with root to the ESXi host where VM reside using SSH.
    3. Navigate to /vmfs/volumes/<VM LUN>/<VM folder>
    In my example it was:
    ~# cd /vmfs/volumes/Local_03esx-mgmt_b/V11_GSS_DO
    4. Shut down problematic VM
    5. Edit .vmx file using VI editor.

IMPORTANT:
Make a backup copy of the .vmx file. If your edits break the virtual machine, you can roll back to the original version of the file.
More information about editing files on ESXi host, refer to KB article: https://kb.vmware.com/kb/1020302

  1. Once you have open vmx to edit, navigate to the bottom of the file and add following lines to the .vmx configuration file:
    sound.present = “true”
    sound.allowGuestConnectionControl = “false”
    sound.virtualDev = “hdaudio”
    sound.fileName = “-1”
    sound.autodetect = “true”
  2. Save file and Power-On Virtual machine.
  3. Once it have booted, and you have enabled Windows Audio Service, sound will work fine.

If you go to “Edit Settings” of the VM, you can see information that device is unsupported. Please be aware that if after adding sound card to you virtual machine, you may exprience any kind of unexpected behavior (tip: in our lab env work this config without issues).

VCSA deployment and migration options

VCSA deployment and migration options

The vCenter Server Appliance deployment experience has been enhanced in the vSphere 6.5 release. Installation workflow is now performed in 2 stages. The first stage deploys an appliance with the basic configuration parameters: IP, hostname, and sizing information including storage, memory, and CPU resources.
vcenter4

Stage 2 then completes the configuration by setting up SSO and role-specific settings. Once Stage 1 is complete we can now snapshot the VM and rollback if any mistakes are made in Stage 2. This prevents from having to start completely over if anything were to go wrong during the deployment process.

NOTE!!! There are versions of the deployment application available for Windows, Linux, and macOS.

 vcenter5

 A new feature in vSphere 6.5 is the ability to migrate a Windows vCenter Server 5.5 or 6.0 to a vCenter Server Appliance 6.5. The migration process starts by running the Migration Assistant, which serves two purposes. The first, pre-checks of the source Windows vCenter Server 5.5 or 6.0 to determine if it meets the criteria to be migrated. Second, it is the data transport mechanism that migrates data from the source Windows vCenter Server 5.5 or 6.0 to the target vCenter Server Appliance 6.5.

The Migration tool will automatically deploy a new vCenter Server Appliance 6.5 and migrate configuration, inventory, and alarm data by default from a Windows vCenter Server 5.5 or 6.0. If you want to keep your historical and performance data (stats, events, tasks) along with configuration, inventory, and alarm data there is the option to also migrate that information. The vSphere 6.5 release of the Migration Tool provides granularity for historical and performance data selection.

vcenter6

Both embedded and external topologies are supported, the Migration Tool will not allow changing your topology during the migration process. Changing of topologies will need to be done before the migration process if consolidation of your vSphere SSO domain is required.

SUMMARY:

  • 5 support for Windows vCenter 5.5 or 6.0 à 6.5
  • Migrations for both embedded and external topologies
  • VUM included
  • Embedded and external Database support: MSSQL, MSSQL Express, Oracle
  • Option to select historical and performance data