Isilon node loses network connectivity after reboot

Isilon H400 chassis with serial cable attachedIn my previous post I described how to reformat an Isilon node if for some reason the cluster creation is defective. After we got our new Gen 6 clusters up and running, we ran into another peculiar issue: the Isilon nodes lose network connectivity after a reboot. If we would then unplug the network cable and move it to a different port on the Isilon node, the network would come online again. Move the cable back to the original port: connectivity OK. Reboot the node: “no carrier” on the interface, and no connectivity.

The setup

We installed two new Isilon clusters that will be replacing two old Isilon clusters. Each node of the old cluster connects to the network with two 10GbE ports in LACP port-channels. Because we will only be running two clusters temporarily, we borrowed one 10GbE uplink from each node in the old cluster and used it for the connectivity of the new cluster. Of course, we created new port-channels per node.

We also cleaned up the connectivity a bit: we only moved over the links that were on the secondary switch, and plugged these into the secondary 10GbE port on the Isilon node. This ensures that the cabling is nice and straightforward and this makes maintenance and troubleshooting in the future easier.

Isilon node loses network connectivity after a reboot

During the initial install I rebooted a few nodes for various reasons. What we noticed is that the Isilon node loses network connectivity after a reboot. After plugging in a serial cable, the interface statistics show a “no carrier” state for the 10GbE port. We verified the configuration on the switch: the vpc, port-channel and interface were configured correctly. The interface would show as shutdown on the switch, but it was not administratively shutdown. Also, the ports were not in error-disabled mode. Weird…

What would solve the problem temporarily was to unplug the fiber of the 10GbE port and move it to port 1. That would bring the link up. Moving it back to port 2, that would also restore connectivity. But one additional reboot and everything went belly up again. And both the network engineer and I agreed that manual cable-plugging isn’t really a feasible long-term strategy to restore connectivity.

We did find out that if we left the cable in 10GbE Port 1 (the bottom port), network connectivity would come back correctly after a reboot. So we had a workaround, just no fix and some unexplained behavior.

Prime suspect: the Isilon itself

It’s easy to blame the network 😉 However, after some additional troubleshooting, both the network engineer and I agreed that it would have to be a host (e.g. Isilon) issue. Lo and behold, I found Dell EMC KB 521890: Isilon nodes running two Intel ports in link aggregation can lose connectivity after boot or major network changes. This talks about a fix in OneFS 8.1.0.4. Our nodes (fresh from the factory) were running 8.1.0.3: a OneFS version that was swiftly recalled by Dell EMC and is not even available for download anymore.

Further analysis of the release notes showed two fixes in 8.1.0.4 that could be useful:

After upgrading all the firmware and the OneFS version, we went back to the setup with all the 10GbE fibers in port 2. Another reboot and… everything stills works. Yay!

My thoughts on new Isilon installs

Do not underestimate the importance of current software releases and firmware. We installed factory fresh Isilon H400 nodes: all the drives and some components of the node needed new firmware. Plus OneFS 8.1.0.4 offers a host of fixes as well. So always ensure you update firmware and software as part of your install procedure. Never assume the software on a new node is fine as it is.