Isilon boot disk – Mirror is degraded

Isilon boot disk error

About 7 or 8 months ago we installed two 3-node Isilon clusters at a hospital destined to host PACS data. Since then it’s all quiet on the Isilon front: no hardware failures, no performance problems. No complaints there of course, but also.. slightly.. boring. FINALLY, a couple of days ago the Isilon sent us an email. “Device disconnected. Boot mirror is critical. Unhealthy Isilon boot disk. Mirror is degraded.” That kind of stuff. Woohoo, ACTION!

These are not the Isilon boot disks you’re looking for…

Upon opening the Isilon GUI, the node featured a yellow warning light. Drilling down to the hardware and the disks, all disks showed green and healthy… wait, what?

A quick ECN / Support search later it turned out that the Isilon OS is not loaded on the disk drives themselves (in this case 36 1TB drives in an NL400 node), but on two dedicated 8GB SSD disks inside the node itself. Don’t think 2,5″ form factor; think SSD-chip-on-a-PCB-with-a-SATA-connector. Quite clever: this means there’s no lost capacity or performance limitations on your “normal” drives (remember CLARiiON / VNX, where you can’t fully utilize your vault drive performance?). But this also means it’s not a customer replaceable unit (CRU), but an EMC CE needs to come in. A quick live chat later and that was arranged…

Replacement procedure

Replacement is pretty straight-forward. Diagnose which disk has failed. Shut down the node. Use the procedure guide to diagnose which disk from which slot needs to be replaced. Replace the Isilon boot disk, boot the node and you’re pretty much done.
Let me reiterate the fact that this procedure needs to be performed by an EMC customer engineer. Do not attempt these steps yourself, since if you pull out the wrong disk you’re looking at a node failure and possible data loss.

The errors indicated disk ad7 failed. Listing the drives is done with the “atacontrol list” command:

node2# atacontrol list
[..]
ATA channel 2:
    Master:  ad4 <SanDisk SSD P4 8GB/SSD 8.10> Serial ATA v1.0 II
    Slave:       no device present
ATA channel 3:
    Master:      no device present
    Slave:   ad7 <SanDisk SSD P4 8GB/SSD 8.10> Serial ATA v1.0 II
[..]

This doesn’t really help, since both drives were still visible and we still did not have a way to double check which drive failed. Use the “gmirror status” command:
node2# gmirror status
                 Name    Status  Components
         mirror/root0  COMPLETE  ad4p4
                                 ad7p4
     mirror/var-crash  COMPLETE  ad4p11
           mirror/mfg  COMPLETE  ad4p10
                                 ad7p9
mirror/journal-backup  COMPLETE  ad4p8
                                 ad7p8
          mirror/var1  COMPLETE  ad4p7
                                 ad7p7
          mirror/var0  COMPLETE  ad4p6
                                 ad7p6
         mirror/root1  COMPLETE  ad4p5
                                 ad7p5

In this example all partitions and mirrors on the boot disks are healthy. If there’s something wrong the status will be degraded and the unhealthy component will be missing: for example mirror/root0 will be degraded and only list ad4p4. This means ad7 is broken.

Now use the procedure guide to locate which slot the Isilon boot disk is located in, shut down and open the node, swap the disks, put the node back together and boot. Verify the boot disk is detected correctly, the mirrors are healthy and clear the errors. Job done!

Or is it?

Once we powered the node back on the front panel would light up but the LCD panel itself was inoperable. Swapping front panels between nodes pinpointed the problem to be node related instead of a faulty front panel. I recalled that there’s a service called isi_lcd_d; a quick process check revealed it wasn’t running. Starting it back up resulted in a couple of errors in /var/log/messages (could not detect LCD device). Starting it a second time fixed that though, spawning three isi_lcd_d threads (server, manager, buttons). And voila, front panel working again.