Browsed by
Category: Infrastructure

VLAN Discovery Failed

VLAN Discovery Failed

Sometimes you can see plenty of strange entries while observing vmkernel.log related to FCoE. It won’t be unusual if you use FCoE. However, if you don’t you could be a little bit curious or even worried about it, the timeouts or “link down” entries aren’t normal for most of vSphere Admins.

The problem could be seen when you are using some kinds of converged network cards. In my case it was HPE C7000 with Virtual Connects and Qlogic 57840. It’s a 10 Gb/s NIC which is also capable of FCoE and iSCSI offload. Anyway, FCoE isn’t used in any part of this infrastructure. Therefore fallowing entries were a little bit strange for me:

<3>bnx2fc:vmhba32:0000:87:00.0: bnx2fc_vlan_disc_timeout:218 VLAN 1002 failed. Trying VLAN Discovery.
<3>bnx2fc:vmhba32:0000:87:00.0: bnx2fc_start_disc:3260 Entered bnx2fc_start_disc
<3>bnx2fc:vmhba32:0000:87:00.0: bnx2fc_vlan_disc_timeout:193 VLAN Discovery Failed. Trying default VLAN 1002
<6>host4: fip: link down.
<6>host4: libfc: Link down on port ( 0)
<3>bnx2fc:vmhba32:0000:87:00.0: bnx2fc_vlan_disc_cmpl:266 vmnic2: vlan_disc_cmpl: hba is on vlan_id 1002

Furthermore I realized that in statistic of  HBA there are listed two unfamiliar adapters: vmhba32 and vmhba33, what else they are listed with different driver used and with no traffic passed.

The driver bnx2fc indicates that it’s a driver of my network card. That’s means that the driver is loaded even if you do not use FCoE. The driver used for my network card is bnx2x, but there available and installed also bnx2fc, bnx2i, bnx2 and cnic. I was determined to make my vmkernel as clear as possible so I decided to turn it off.

After some investigation and test I managed to do it and get rid of these rubbish entries.

To turn off the FCoE in case you do not use it, you have to perform fallowing steps:

1. Remove the bnx2fc vib

        # esxcli software vib remove --vibname=scsi-bnx2fc

2. Move to /etc/rc.local.d and remove a script called 99bnx2fc.sh which is responsible for loading the driver when the host boots.

3. Disable the FCoE on all network cards involved:

     # esxcli fcoe nic disable -n vmnicX

4. Reboot the host and check that the errors aren’t present anymore in the logs.

 

Despite in driver version 2.713.10.v60.4 according to release notes which can be found here the problem should be resolved, however in my case it wasn’t.

ESXi host connection lost due to CDP/LLDP protocol

ESXi host connection lost due to CDP/LLDP protocol

You can observe some random and intermittent loss of connection to ESXi 6.0 host running on Dell servers (both Rack and Blade). It’s caused by a bug with Cisco Discovery Protocol /Link Layer Discovery Protocol.  It can be also seen while generating VMware support log bundle because during this process these protocol are also used to include information about the network.

 

What are these protocols for? Both of them perform similar roles in the local area network. They are used by network devices to advertise their identity, capabilities and neighbors. The main difference is that CDP is a Cisco proprietary protocol and LLDP is vendor-neutral. There are also other niche protocols like Nortel Discovery Protocol, Foundry Discovery Protocol or Link Layer Topology.

CDP and LLDP are also compatible with VMware virtual switches and thereby they can gather and display information about the physical switches.  CDP is available for both standard and distributed switches whilst LLDP is available only for distributed virtual switches since vSphere 5.0

cdp

Cisco Discovery Protocol information displayed on vSwitch level.

 
There is currently no resolution for this bug but thanks to the VMware Technical Support the workaround described below is available.

 

Turn off the CDP for each  vSwitch:

# esxcfg-vswitch –B down vSwitchX

You can also verify the current status of CDP using fallowing command:

# esxcfg-vswitch –b vSwitchX

This simple task will resolve the problem with random connection loss of ESXi hosts. Anyway it will not solve the problem with loss of connection during generation of log bundle.

To confirm that the prblem exist you can simply run fallowing command:

# vm-support –w /vmfs/volumes/datastore_name

Even though we turned off the CDP, during log generation process ESXi are using it to gather information about network topology.

To fix it you have to download this script called disablelldp2.py and perform the steps below:

  1. Copy the script to a datastore which is shared with all hosts,
  2. Open SSH to an ESXi host,
    1. Move to a destination where you copied the script,
    2. Grant the permission: # chmod 555 disablelldp2.py,
  3. Run the script: ./disablelldp2.py,
  4. After the script is executed move to /etc/rc.local.d and edit local.sh file. It should look like this:

#!/bin/sh

# local configuration options

# Note: modify at your own risk!  If you do/use anything in this # script that is not part of a stable API (relying on files to be in # specific places, specific tools, specific output, etc) there is a # possibility you will end up with a broken system after patching or # upgrading.  Changes are not supported unless under direction of # VMware support.

ORIGINAL_FILE=/sbin/lldpnetmap

MODIFIED_FILE=/sbin/lldpnetmap.original

if test -e “$MODIFIED_FILE”
then
echo “$MODIFIED_FILE already exists.”
else
mv “$ORIGINAL_FILE” “$MODIFIED_FILE”
echo “Omitting LLDP Script.” > “$ORIGINAL_FILE”
chmod 555 “$ORIGINAL_FILE”
fi
exit 0

  1. Restart the ESXi server and run vm-support command to confirm that the problem is solved.

 

 

 

 

 

 

Virtual SAN Storage performance tests in action

Virtual SAN Storage performance tests in action

Virtual SAN provides the Storage Performance Proactive Tests which lets you to check parameters of your environment in an easy way. You just need few clicks to run a test.

Well, we can see a lot of tests from nested labs in the Internet, however there are not so many real examples.

I decided to share some results of such a real environment which consists 5 ESXi hosts. Each is equipped with 1x SSD Disk 1,92 TB and 5x HDD 4 TB.

It’s almost 10% of space reserved for cache on SSD. VMware claims that the minimum is 10% so it shouldn’t be so bad.

Since Virtual SAN 6.2 released with vSphere 6.0 Update 2, VMware makes the testing of performance much more easy. That’s possible thanks to Storage Performance Proactive Tests which lets you to check parameters of your environment in an easy way. They are available from Web Client, you just need few clicks to run a test. Perhaps those aren’t the most sophisticated whilst they are really easy to use.

To start some tests, simply move to Monitor>Virtual SAN>Proactive Tests tab on VSAN cluster level and click at run test button (green triangle).

As you quickly realise there are few kind of tests:

  • Stress Test
  • Low Stress test
  • Basic Sainity Test, focus on Flash cache layer
  • Performance characterization – 100% Read, optimal RC usage
  • Performance characterization – 100% Write, optimal WB usage
  • Performance characterization – 100% Read, optimal RC usage after warm-up
  • Performance characterization – 70/30 read/write mix, realistic, optimal flash cache usage
  • Performance characterization – 70/30 read/write mix, realistic, High I/O Size, optimal flash cache usage
  • Performance characterization – 100% read, Low RC hit rate / All-Flash Demo
  • Performance characterization – 100% streaming reads
  • Performance characterization – 100% streaming writes

 

Let’s start from a multicast performance test of our network. In case the received bandwidth is below 75 MB/s the rest would fail.

multicast-performance-test

Caution!!! VMware doesn’t recommend using this tests  on production environments especially during business hours.

Test number 1 – Low stress test, the duration set to 5 minutes.

low-stress

As we can see the IOPS counts around 40K for my cluster of five hosts. The average throughput around 30 MB/s, gives ca. 155 MB/s total.

Test number 2 – Stress test, duration set to 5 minutes.

stress-test

Here we can see that my VSAN reached about 37K of IOPS and almost 260 MB /s of throughput.

Test number 3 – Basic Sanity test, focus on Flash cache layer, duration set to 5 minutes

basic-sanity-test-focus-on-flash-cache-layer

Test number 4 – 100% read, optimal RC usage, duration set to 5 minutes.

100-percent-read-optimal-rc-usage

Here we can see how does the SSD performs while most of the reads

Test number 5 –  100% read, optimal RC usage after warm-up

100-percent-read-optimal-rc-usage-after-warmup

Test number 6 – 100 % write, optimal WB usage

100-percent-write-optimal-wb-usage

 

 

If you have any other real results of your VSAN I’d be glad to see it and compare different configurations.