About 7 or 8 months ago we installed two 3-node Isilon clusters at a hospital destined to host PACS data. Since then it’s all quiet on the Isilon front: no hardware failures, no performance problems. No complaints there of course, but also.. slightly.. boring. FINALLY, a couple of days ago the Isilon sent us an email. “Device disconnected. Boot mirror is critical. Unhealthy Isilon boot disk. Mirror is degraded.” That kind of stuff. Woohoo, ACTION!
These are not the Isilon boot disks you’re looking for…
Upon opening the Isilon GUI, the node featured a yellow warning light. Drilling down to the hardware and the disks, all disks showed green and healthy… wait, what?
A quick ECN / Support search later it turned out that the Isilon OS is not loaded on the disk drives themselves (in this case 36 1TB drives in an NL400 node), but on two dedicated 8GB SSD disks inside the node itself. Don’t think 2,5″ form factor; think SSD-chip-on-a-PCB-with-a-SATA-connector. Quite clever: this means there’s no lost capacity or performance limitations on your “normal” drives (remember CLARiiON / VNX, where you can’t fully utilize your vault drive performance?). But this also means it’s not a customer replaceable unit (CRU), but an EMC CE needs to come in. A quick live chat later and that was arranged…
Replacement is pretty straight-forward. Diagnose which disk has failed. Shut down the node. Use the procedure guide to diagnose which disk from which slot needs to be replaced. Replace the Isilon boot disk, boot the node and you’re pretty much done.
Let me reiterate the fact that this procedure needs to be performed by an EMC customer engineer. Do not attempt these steps yourself, since if you pull out the wrong disk you’re looking at a node failure and possible data loss.
The errors indicated disk ad7 failed. Listing the drives is done with the “atacontrol list” command:
node2# atacontrol list [..] ATA channel 2: Master: ad4 <SanDisk SSD P4 8GB/SSD 8.10> Serial ATA v1.0 II Slave: no device present ATA channel 3: Master: no device present Slave: ad7 <SanDisk SSD P4 8GB/SSD 8.10> Serial ATA v1.0 II [..]
This doesn’t really help, since both drives were still visible and we still did not have a way to double check which drive failed. Use the “gmirror status” command:
node2# gmirror status Name Status Components mirror/root0 COMPLETE ad4p4 ad7p4 mirror/var-crash COMPLETE ad4p11 mirror/mfg COMPLETE ad4p10 ad7p9 mirror/journal-backup COMPLETE ad4p8 ad7p8 mirror/var1 COMPLETE ad4p7 ad7p7 mirror/var0 COMPLETE ad4p6 ad7p6 mirror/root1 COMPLETE ad4p5 ad7p5
In this example all partitions and mirrors on the boot disks are healthy. If there’s something wrong the status will be degraded and the unhealthy component will be missing: for example mirror/root0 will be degraded and only list ad4p4. This means ad7 is broken.
Now use the procedure guide to locate which slot the Isilon boot disk is located in, shut down and open the node, swap the disks, put the node back together and boot. Verify the boot disk is detected correctly, the mirrors are healthy and clear the errors. Job done!
Or is it?
Once we powered the node back on the front panel would light up but the LCD panel itself was inoperable. Swapping front panels between nodes pinpointed the problem to be node related instead of a faulty front panel. I recalled that there’s a service called isi_lcd_d; a quick process check revealed it wasn’t running. Starting it back up resulted in a couple of errors in /var/log/messages (could not detect LCD device). Starting it a second time fixed that though, spawning three isi_lcd_d threads (server, manager, buttons). And voila, front panel working again.