ESXi and Likewise – troubleshooting guide – part 2

ESXi and Likewise – troubleshooting guide – part 2

In last part of this small series, we discussed theoretical background about components and technology related for adding ESX host to windows AD environment. Now it is time to describe troubleshooting options and some real life problems with solutions.

Let’s start from dividing all ESXi/Likewise issues into categories:

  1. Domain Join Failures

Here are most often reasons that an attempt to join a domain fails:

  • The user name or password of the account used to join the domain is incorrect.
  • The name of the domain is mistyped.
  • The name of the OU is mistyped.
  • The local hostname is invalid.
  • The domain controller is unreachable from the client because of a firewall or because the NTP service is not running on the domain controller.
  • Verify that the Name Server Can Find the Domain

# nslookup <AD Domain>

  • Make Sure the Client Can Reach the Domain Controller

verify that ESX host can reach the domain controller by pinging it.

  • Verify that Outbound Ports Are Open
  • Port 88 – Kerberos authentication
  • Port 123 – NTP
  • Port 135 – RPC
  • Port 137 – NetBIOS Name Service
  • Port 139 – NetBIOS Session Service (SMB)
  • Port 389 – LDAP
  • Port 445 – Microsoft-DS Active Directory, Windows shares (SMB over TCP)
  • Port 464 – Kerberos – change/password changes
  • Port 3268- Global Catalog search
  • Check DNS Connectivity

make sure the nameserver entry in /etc/resolv.conf contains the IP address of a DNS server that can resolve the name of the domain you are trying to join.

  • Make Sure nsswitch.conf Is Configured to Check DNS for Host Names

The /etc/nsswitch.conf file must contains the following line:

hosts: files dns

  • Ensure that DNS Queries Are Not Using the Wrong Network Interface Card

If the ESX host is multi-homed, the DNS queries might be going out the wrong network interface card. Temporarily disable all the NICs except for the card on the same subnet as your domain controller or DNS server and then test DNS lookups to the AD domain. If this works, re-enable all the NICs and edit the local or network routing tables so that the AD domain controllers are accessible from the host.

  • Determine Whether the DNS Server Is Configured to Return SRV Records

Your DNS server must be set to return SRV records so the domain controller can be located. It is common for non-Windows (bind) DNS servers to not be configured to return SRV records.

Diagnose by executing the following command:

nslookup -q=srv _ldap._tcp. ADdomainToJoin.com

  • Make Sure that the Global Catalog Is Accessible

The global catalog for Active Directory must be accessible. Diagnose by executing the following command:

nslookup -q=srv _ldap._tcp.gc._msdcs. ADrootDomain.com

From the list of IP addresses in the results, choose one or more addresses and test whether they are accessible on Port 3268 by using telnet.

  • Verify that the Client Can Connect to the Domain on Port 123

Windows time service must be running on the domain controller.

On a Linux computer, run the following command as root:

ntpdate -d -u DC_hostname

  1. Log-in/Authentication issues
  • Make Sure You Are Joined to the Domain

Check ‘lw-lsa get-status’

  • Clear the Cache

Clear the cache to ensure that the client computer recognizes the user’s ID.

# ad-cache –delete-all

Clear the Likewise Kerberos cache to make sure there is not an issue. Execute the following command at the shell prompt with the user account that you are troubleshooting:

~#kdestroy

  • Check the Status of the Likewise Authentication Daemon

#/etc/init.d/lsassd status

  • Check Communication between the Likewise Daemon and AD

verify that the you can ping DC from ESX host.

  • Make Sure the AD Authentication Provider Is Running

# lw-lsa get-status

If the result will not include the AD authentication provider or will indicate that it is offline restart the authentication daemon

  • Check whether you can log on with SSH by executing the following command:

ssh DOMAIN\\username@localhost

  1. Lsassd crash due to various reasons such as during trust enumeration etc.
  • analyze the lsassd,netlogond,lwiod logs, see where exactly where likewise daemon is crashing.
  • look into the hostd logs and tcpdump to get more info
  1. Kerberos related issues
  • start to look into the packet capture (both sites esxi and ad) to see if we’re getting proper TGT and TGS.

//can be related to Kerberos cache so in this case empty the Kerberos cache using mentioned  ‘kdestory’ command.

  1. Hostd crash in Likewise code
  • Gather full log bundle and engage VMware GSS
  1. Windows AD server related issues
  • Gather guest OS logs and engage MS Support.

Ok., so now we have in one place all troubleshooting options and methodology, now it is time for real life story experience based on one of my last service requests: Customer is unable to log in using Active Directory credentials. It shows invalid credentials even though “Authentication Services” shows that host is joined into domain correct domain. The issue is seen on most of the hosts within the environment. Only 2 hosts do not suffer from the problem – cannot find any difference in configuration. Customer running latest 6.0 build: 4600944

Some other symptoms observed during troubleshooting issue step by step:

  1. Tried to disjoin server outside the domain using vSphere Client GUI on the host connected to vCenter – host stops responding unless we restart hostd. Restarting all management agents hangs on likewise agent for an infinite time.
  2. Unable to stop Active Directory Service – server not responding. After restarting hostd, or entire host – server back to normal operational state
  3. Change Active Directory Service to not start with the host -> restart ESXi – works
  4. Check auth type – now ESXi states that it is Local Authentication (so after all the restarts, finallly ESXi left the domain)
  5. Add host once again to the domain – host stops responding. Restart hostd – works fine
  6. Check auth type – ESXi states that he is joined to domain.
  7. Try to add permissions to the domain users – unable to select domain to assign permissions
  8. From AD perspective – ESXi account is refreshed

Troubleshooting Action Taken

===============

  1. Verify if likewise agents is up and running (It is)
  2. Restart likewise agent on the hosts (no impact on issue)
  3. Add advanced setting UserVars.ActiveDirectoryPreferredDomainControllers as per KB https://kb.vmware.com/kb/2107385 – Didn’t help
  4. To exclude any firewall issues blocking Domain controller traffic: ~# esxcli network firewall unload and retry login with domain account- Didnt help
  5. Increased likewise agent logging to debug and:
  6. a) Re-try domain authentication to see log entries
  7. b) Tried to leave -> rejoin domain using CLI (leave succesful, rejoin causes host to hang again unless we reboot host)
  8. Verify known issues in 6.0 related to authentication with AD – issues resolved in 6.0U1, while customer using latest patch

 

Log Analysis

  1. Trying to stop LWSMD using SSH

[~] /etc/init.d/lwsmd stop

watchdog-lwsmd: Terminating watchdog process with PID 36150 Stopping Likewise Service Manager [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] …failed

 

  1. Retry domain authentication with debug likewise logging (authentication does not succeed):

20161208115138:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass.XXX.COM

20161208115138:DEBUG:lsass:MemCacheFindGroupByName():memcache.c:1081: Error code: 40017 (symbol: LW_ERROR_NOT_HANDLED)

20161208115138:DEBUG:lsass:LsaSrvFindProviderByName():state.c:128: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:DEBUG:lsass:LsaSrvProviderServicesDomain():provider.c:151: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:VERBOSE:lsass:LsaAdBatchMarshal():batch_marshal.c:525: Did not find object by NT4 name ‘ESX Admins’

20161208115138:DEBUG:lsass:LsaAdBatchFindSingleObject():batch.c:1388: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_FindObjectByNameTypeNoCache():online.c:3519: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_OnlineFindObjectByName():online.c:4129: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115138:DEBUG:lsass:LsaSrvFindGroupAndExpandedMembers():api2.c:1626: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115338:VERBOSE:lsass:LsaSrvIpcCheckPermissions():ipc_state.c:79: Permission granted for (uid = 0, gid = 0, pid = 169257) to open LsaIpcServer

20161208115338:VERBOSE:lsass-ipc:lwmsg_peer_log_accept():peer-task.c:271: (session:f09bcf7743520e1d-b414124c53159168) Accepted association 0x1f0e5be8

20161208115338:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass. 1. Trying to stop LWSMD using SSH

[root@plpa2ex19irvm:~] /etc/init.d/lwsmd stop

watchdog-lwsmd: Terminating watchdog process with PID 36150 Stopping Likewise Service Manager [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] [failed to release memory reservation ] …failed

 

  1. Retry domain authentication with debug likewise logging (authentication does not succeed):

20161208115138:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass.XXX.COM

20161208115138:DEBUG:lsass:MemCacheFindGroupByName():memcache.c:1081: Error code: 40017 (symbol: LW_ERROR_NOT_HANDLED)

20161208115138:DEBUG:lsass:LsaSrvFindProviderByName():state.c:128: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:DEBUG:lsass:LsaSrvProviderServicesDomain():provider.c:151: Error code: 40040 (symbol: LW_ERROR_INVALID_AUTH_PROVIDER)

20161208115138:VERBOSE:lsass:LsaAdBatchMarshal():batch_marshal.c:525: Did not find object by NT4 name ‘ESX Admins’

20161208115138:DEBUG:lsass:LsaAdBatchFindSingleObject():batch.c:1388: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_FindObjectByNameTypeNoCache():online.c:3519: Error code: 40071 (symbol: LW_ERROR_NO_SUCH_OBJECT)

20161208115138:DEBUG:lsass:AD_OnlineFindObjectByName():online.c:4129: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115138:DEBUG:lsass:LsaSrvFindGroupAndExpandedMembers():api2.c:1626: Error code: 40012 (symbol: LW_ERROR_NO_SUCH_GROUP)

20161208115338:VERBOSE:lsass:LsaSrvIpcCheckPermissions():ipc_state.c:79: Permission granted for (uid = 0, gid = 0, pid = 169257) to open LsaIpcServer

20161208115338:VERBOSE:lsass-ipc:lwmsg_peer_log_accept():peer-task.c:271: (session:f09bcf7743520e1d-b414124c53159168) Accepted association 0x1f0e5be8

20161208115338:DEBUG:LwKrb5SetThreadDefaultCachePath():lwkrb5.c:410: Switched gss krb5 credentials path from <null> to FILE:/etc/likewise/lib/krb5cc_lsass.XXX.COM

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX’, site ‘<null>’ with flags 100

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX.com’, site ‘<null>’ with flags 100

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX.com’, site ‘<null>’ with flags 140

20161208115338:DEBUG:netlogon:LWNetCacheDbQuery():lwnet-cachedb.c:1079: Cached entry not found: XXX.com, , 1

20161208115338:DEBUG:netlogon:LWNetSrvGetDCName():dcinfo.c:128: Error at ../netlogon/server/api/dcinfo.c:128 [code: 1355]

20161208115338:DEBUG:netlogon:LWNetTransactGetDCName():ipc_client.c:249: Error at ../netlogon/client/ipc_client.c:249 [code: 1355]

20161208115338:DEBUG:netlogon:LWNetGetDCNameExt():dcinfo.c:133: Error at ../netlogon/client/dcinfo.c:133 [code: 1355]

 

  1. Try to rejoin domain (which causes host to hang in the end):

20161214123838:VERBOSE:lsass:LsaSrvIpcCheckPermissions():ipc_state.c:79: Permission granted for (uid = 0, gid = 0, pid = 39070) to open LsaIpcServer

20161214123838:VERBOSE:lsass-ipc:lwmsg_peer_log_accept():peer-task.c:271: (session:6b1bb0e33d95252a-e893c9a774c67d8e) Accepted association 0x1f07fe00

20161214123838:VERBOSE:lwreg:RegDbOpenKey():sqldb.c:1032: Registry::sqldb.c RegDbOpenKey() finished

20161214123838:DEBUG:lwreg:RegDbGetKeyValue_inlock():sqldb_p.c:1227: Error at ../lwreg/server/providers/sqlite/sqldb_p.c:1227 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:RegDbGetValueAttributes_inlock():sqldb_schema.c:846: Error at ../lwreg/server/providers/sqlite/sqldb_schema.c:846 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:VERBOSE:lwreg:SqliteGetValueAttributes_Internal():regschema.c:360: Registry::sqldb.c SqliteGetValueAttributes_Internal() finished

20161214123838:DEBUG:lwreg:SqliteGetValue():sqliteapi.c:887: Error at ../lwreg/server/providers/sqlite/sqliteapi.c:887 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:RegTransactGetValueW():clientipc.c:810: Error at ../lwreg/client/clientipc.c:810 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:LwNtRegGetValueA():regntclient.c:801: Error at ../lwreg/client/regntclient.c:801 [status: LW_STATUS_OBJECT_NAME_NOT_FOUND = 0xC0000034 (-1073741772)]

20161214123838:DEBUG:lwreg:RegShellUtilGetValue():rsutils.c:1463: Error at ../lwreg/shellutil/rsutils.c:1463 [code: 40700]

20161214123838:DEBUG:LwpsLegacyGetDefaultJoinedDomain():lsapstore-backend-legacy-internal.c:711: -> 0 (ERROR_SUCCESS) (EE = 685)

20161214123838:DEBUG:LsaPstoreGetPasswordInfoW():lsapstore-main.c:109: -> 2692 (NERR_SetupNotJoined) (EE = 80)

20161214123838:DEBUG:LsaPstoreGetPasswordInfoA():lsapstore-main-a.c:89: -> 2692 (NERR_SetupNotJoined) (EE = 71)

20161214123838:DEBUG:lsass:AD_GetMachineAccountInfoA():machinepwdinfo.c:91: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161214123838:DEBUG:lsass:AD_IoctlGetMachineAccount():ioctl.c:102: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161214123838:DEBUG:lsass:AD_ProviderIoControl():provider-main.c:4377: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161214123838:DEBUG:lsass:LsaSrvProviderIoControl():provider.c:99: Error code: 2692 (symbol: NERR_SetupNotJoined)

20161208115338:INFO:netlogon:LWNetSrvGetDCName():dcinfo.c:97: Looking for a DC in domain ‘XXX.com’, site ‘<null>’ with flags 140

20161208115338:DEBUG:netlogon:LWNetCacheDbQuery():lwnet-cachedb.c:1079: Cached entry not found: XXX.com, , 1

20161208115338:DEBUG:netlogon:LWNetSrvGetDCName():dcinfo.c:128: Error at ../netlogon/server/api/dcinfo.c:128 [code: 1355]

20161208115338:DEBUG:netlogon:LWNetTransactGetDCName():ipc_client.c:249: Error at ../netlogon/client/ipc_client.c:249 [code: 1355]

At this stage we decide to gather network packets and analyze communication between esxi nad DC, time show that this was a good direction:

//packet capture methodology

  • eneble likewise loging:

/etc/init.d/lwsmd start

/usr/lib/vmware/likewise/bin/lwsm set-log-level trace /usr/lib/vmware/likewise/bin/lwsm set-log file /var/log/likewise.log tail -f /var/log/likewise.log

  • start tcp dump

tcpdump-uw -i 1 -n -s0 not tcp port 22 -C 50M -W 5 -w /var/log/capture10.pcap -vvv

 

  • add ESXi to domain from cli to capture comunication flow:

/usr/lib/vmware/likewise/bin/domainjoin-cli –loglevel verbose –logfile

join xxx.com plp24308
esxi and likewise2

We foud that on problematic ESXi hosts IPv6 communication was disabled but DC still using IPv6 in communication after couple test we confirm that after enabling IPv6 on ESXi or totally disabling it at   DC site:

https://support.microsoft.com/en-us/help/929852/how-to-disable-ipv6-or-its-components-in-windows

finally, there is no error with adding a host to the domain and DC authentication.

To clear more this whole situation we decided to perform additional investigation with VMware Support. GSS confirmed that they located the issue:

“…with the newer versions (vSphere 6) of ESXi in case it receives kdc in IPv6 format. In that situation the host will try to connect with IPv6. In case host has IPv6 disabled it will fail to join the domain “

//Bug is planned to be fixed on vSphere6.5U1

One thought on “ESXi and Likewise – troubleshooting guide – part 2

  1. Hello! I could have sworn I’ve been to this blog before but after reading through some of the post I realized it’s
    new to me. Anyways, I’m definitely glad I found
    it and I’ll be book-marking and checking back often!

Leave a Reply

Your email address will not be published. Required fields are marked *