In my previous post I described how to reformat an Isilon node if for some reason the cluster creation is defective. After we got our new Gen 6 clusters up and running, we ran into another peculiar issue: the Isilon nodes lose network connectivity after a reboot. If we would then unplug the network cable and move it to a different port on the Isilon node, the network would come online again. Move the cable back to the original port: connectivity OK. Reboot the node: “no carrier” on the interface, and no connectivity.
While installing a new Dell EMC Isilon H400 cluster, I noticed node 1 in the chassis was acting up a bit. It allowed me to go through the initial cluster creation wizard, but didn’t run through all the steps and scripts afterwards. I left the node in that state while I installed another cluster, but after two hours or so, nothing had changed. With no other options, I pressed Ctrl + C: the screen became responsive again and eventually the node rebooted. However, it would never finish that boot, instead halting at “/ifs not found”. Eventually, it would need a reformat before it would function properly again…
Last month I’ve performed a Isilon tech refresh of two clusters running NL400 nodes. In both clusters, the old NL400 36TB nodes were replaced with 72TB NL410 nodes with some SSD capacity. First step in the whole process was the replacement of the Infiniband switches. Since the clusters were fairly old, an OneFS upgrade was also on the list, before the cluster could recognize the NL410 nodes. Dell EMC has extensive documentation on the whole OneFS upgrade process: check the support website, because there’s a lot of version dependencies. Finally, everything was prepared and I could begin with the actual Isilon tech refresh: getting the new Isilon nodes up and running, moving the data and removing the old nodes.
If you’re remotely managing a Linux machine, you’ll probably use an SSH connection to run commands on that machine. There’s one problem with this approach: if you close the SSH connection, any long-running jobs/commands will halt. If you know a job will take a long time and you won’t be able to babysit the SSH connection, you can plan accordingly. But what if you underestimated the time a job will take, and you need to disconnect anyway? Here’s how to keep the job running AND make it home in time for dinner!
While upgrading OneFS it’s important to keep the InsightIQ software version compatible with the Isilon systems. In this case, InsightIQ wasn’t updated for a while and I had to upgrade from 3.0 -> 3.1 -> 3.2 -> 4.x. The actual upgrade process isn’t too hard (it just takes a lot of time), but there’s one little prerequisite in the 3.1 -> 3.2 upgrade: a minimum free space in the root partition of 502MB. As you can see in the screenshot, I wasn’t even close to the minimum requirement. I got to 357 MB, and that’s after cleaning up redundant stuff. Time to add some more disk space and extend root partition!