While installing a new Dell EMC Isilon H400 cluster, I noticed node 1 in the chassis was acting up a bit. It allowed me to go through the initial cluster creation wizard, but didn’t run through all the steps and scripts afterwards. I left the node in that state while I installed another cluster, but after two hours or so, nothing had changed. With no other options, I pressed Ctrl + C: the screen became responsive again and eventually the node rebooted. However, it would never finish that boot, instead halting at “/ifs not found”. Eventually, it would need a reformat before it would function properly again…
Dell EMC Isilon Gen 6 nodes
Let’s back up a bit first. The Isilon H400 node is part of the 6th generation of Isilon hardware. One of the most fundamental changes compared to previous generations, is that the 6th generation of Isilon nodes supports an Ethernet back-end. Unfortunately, it’s not possible to mix and match Infiniband and Ethernet backends. This gave us two options: install Infiniband cards in the new Gen 6 nodes, or build a new Ethernet based cluster and use SyncIQ to migrate the data. The former option would only push out the transition from Infiniband to Ethernet with another 4-5 years and would be a bit more expensive. We went with the latter option and built two new clusters for this site.
What went wrong in this case?
We installed the H400 chassis in the rack, added the Ethernet back-end switches. Cabled everything up (back-end, front-end and power). The H400 nodes (there’s four in a chassis) powered on automatically once power was applied.
When connecting the serial cable, it’s normal that you will not have anything on your PuTTy screen. You need to send a key press first to wake up the terminal. Usually I press Enter, and for a new Isilon cluster that should repeat the 4 options available: Create a new cluster, Join an existing cluster, Exit Wizard, and Reboot into SmartLock Compliance mode. In this case, nothing happened.
Of course, you check the serial settings. 115200 baud, 8 data bits, 1 stop bit, no parity, HW flow control. Check, correct. Still, no output, not even when pressing other keys. Eventually I pressed Ctrl+C: voila, output, and the node continued booting. I spotted something about syslogd that wasn’t functioning, or at least hung while loading. This was a bit sketchy, but nevertheless, I could create the cluster with the wizard.
Upon completion of the wizard, the system again started performing a few automatic steps. What you should be seeing at some point during those steps is the Self-Encrypting Drives (SEDs) formatting. However, this cluster stopped again near the syslogd step. I left it like that for 2 hours while I installed another cluster. After pressing Ctrl+C again, we ended up with a nonfunctional Isilon node that wouldn’t start the cluster.
IFS failed to mount. Aborting boot.
Please contact Isilon Customer Support at 1-877-2-ISILON.
1) Enter recovery shell
2) Continue booting
Boot into the recovery shell. In there, you can run the command “isi_reformat_node”. It will ask you twice whether you are sure. Pretty obviously: you WILL lose all the data on that node. That said, you do not need any USB sticks or whatsoever, as you are not re-imaging the Isilon node.
Reformatting node and destroying filesystem
Are you sure (yes/no) [no]? yes
This could destroy data
Are you really sure (yes/no) [no]? yes
Creating lkg from current root
Creating var lkg from current var
The node will disown all disks, format itself and reboot. The whole process takes about 5-10 minutes, after which you have a fresh node and can try again. In our case, this solved the issue. I reran the cluster creation wizard and was greeted with the SED format phase.
After about 15-20 minutes the whole creation process was complete, and the cluster was available. Adding the remaining 3 nodes after that is a piece of cake.