VNX Unified standby data mover interface suspended after Cisco NX-OS upgrade

We’re in the midst of a VCE vBlock 340 software upgrade. Part of this upgrade process is upgrading the Cisco Nexus 5K switches that connect the blades and storage to the customer network. After upgrading the switch we suddenly noticed on the switch that the VNX Unified standby data mover (server_3) interface suspended with a “no LACP PDUs” error message. A quick check on the switch that wasn’t upgraded yet showed that interface to be online. So what’s up with that?

VNX Unified data movers

A quick recap on the VNX Unified NAS functionality. The base VNX is a block-only system with FC and iSCSI connectivity. If you also want to provide NAS/file services (e.g. NFS and SMB (not CIFS!)) from a VNX, you buy a VNX Unified system. This adds one or two Control Stations (for management purposes) and two or more data movers, the actual workhorses that present the shares/exports to clients.

VNX Unified data mover failover how it worksThis customer is running the minimum data mover configuration of two data movers: one active, one standby. The active data mover is providing the NAS services, the standby data mover is… well.. standby.

As soon as a failure takes place, the standby data mover takes over the following identities from the active data mover:

  • Network identity — IP and MAC addresses of all its NICs
  • Storage identity — File systems controlled by the faulted Data Mover
  • Service identity — Shares and exports controlled by the faulted Data Mover

This whole process takes a few seconds. It also means that the standby data mover should have identical network connectivity as the primary, since it assumes the identity of the primary blade. Finally, the standby data mover will not behave like a ‘normal’ server while it’s not active: it’s in a hot standby mode, waiting.

So if your primary data mover uses a LACP configuration, the switch ports on the standby data mover also need to be configured in a LACP port channel. Only difference is that the standby server will not send any LACP PDUs yet, since it’s waiting to take over the network identity…

NX-OS upgrade 5.2(1)N1(9a) to 7.1(4)N1(1)

The upgrade we had to perform was from NX-OS 5.2(1)N1(9a) to 7.1(4)N1(1). Quite a big leap, and you’ll need to do an intermediate upgrade step to 7.0(8)N1(1) to avoid hitting Cisco Bug CSCuq94445 (ISSU upgrade is not supported in 5.2(1)N1(9a) and earlier versions up to 7.1(4)N1(1)). If you don’t use the intermediate upgrade step (*cough*), your switch will lose the FC configuration, which needs the following steps to recover FC functionality (VCE KB 000005024):

  1. Change the Ethernet port to FC as per backup configuration, save the configuration and reload the module if expansion module used or reload switch if Fixed port used.
  2. Configure FC and san-port-channel from backup configuration taken prior to upgrade.
  3. Bound the interface to vFC interfaces as per backup configuration.
  4. Add the ports to VSAN database.

Assuming you survived the upgrade, you’ll now see the standby data mover interface suspended, and the corresponding port-channel also down:

Ethernet1/16 is down (suspended(no LACP PDUs))

port-channel203 is down (No operational members)
vPC Status: Down, vPC number: 203 [packets forwarded via vPC peer-link]

This is due to a NX-OS enhancement request which will cause LACP to put the port to suspend instead of individual state in case it does not get LACP PDUs from the other end of the link in the port-channel. It was requested in enhancement request CSCut55084 and implemented since NX-OS release 7.1(3)N1(1). It’s an overall improvement since it will prevent some of the loops that are caused by misconfigurations, but may cause issues when the standby data mover wants to take over.

So, to fix this, lets reset the interface behavior with the no lacp suspend-individual command, which you need to enter on the port channel. The port channel needs to be shutdown for this parameter, but since it’s the standby data mover with the port channel down, that’s no problem. Run the following commands:

conf t
interface port-channel 203
no lacp suspend-individual
no shutdown
copy running-config startup-config

Of course, substitute the port channel number with your own.

Resulting port-channel config to prevent the individual interface from going suspended
Resulting port-channel config to prevent the individual interface from going suspended

You should no longer have a interface suspended. The port channel will still be down though, but that’s normal behavior for the standby datamover:

N5K-2# sh int eth 1/16
Ethernet1/16 is up

N5K-2# sh int po 203
port-channel203 is down (No operational members)
vPC Status: Down, vPC number: 203 [packets forwarded via vPC peer-link]

It’s also a good idea to run the same command on the port channel(s) for the primary data mover, but you’ll need a maintenance window to do this since you need to take the port channel offline. Good luck!