Safely replace an Isilon InfiniBand Switch with these steps

Isilon Infiniband switch 8 port MellanoxEvery once in a while you might need to replace an Isilon infiniband switch. Possibly because of a broken switch, the need for more ports, or because the old switch is.. too old. Good news: it’s a fairly straightforward job.  And if your cluster has two switches, you can replace a switch at a time without outage.

Check the Isilon InfiniBand switch connectivity

If, during the initial installation, you took some time to cable everything properly, this should be a easy step. All int-a InfiniBand cables should be going to one switch, and all int-b cables to the other. Check to make sure.

Next up we need to check the logical connection. In a hurry you could check for blinking lights, but that’s not a proper test. First, log in to any node in the cluster using SSH/PuTTY and run the following commands to find the internal IP ranges:

/usr/bin/isi config

You should get three IP ranges: int-a, int-b, and failover.

Isilon Infiniband ip ranges

Next, ping the node IP adresses. If you know there are 4 nodes (like in this example), the first one starting at x.x.x.1 and the last one being x.x.x.4, run the following command:

for i in {1..4}; do ping -c 1 x.x.x.${i};done

You should get similar output like in the CLI below:

Isilon Infiniband ping

Repeat the same command for the int-b range.

Replacing the Isilon InfiniBand switch

Make sure you know where all the cables on the old switch go (labeling might help here). Unplug the power cable from the old switch. It doesn’t hurt to keep one eye on the OneFS GUI to see if the Isilon cluster copes with the connectivity loss well…

Unplug the InfiniBand cables going to the old nodes. Unscrew and remove the old switch. Make sure to mark it as OLD in some way, if you’re doing a lot of swaps with similar switches.

Install the new switch. Plug the InfiniBand cables back in. Plug the power cable back in to power it up. The switch boots quickly; the ports should usually be online in <10 seconds. Done!

Next, check the OneFS GUI again. The ‘redundancy/connectivity loss’ errors should disappear/clear (give it a few minutes). Acknowledge/clear where necessary to clean up your events log.

Next, run the above CLI commands again to ping the IP ranges. If all is well, you can move on to the next switch. Happy swapping!