CLARiiON CX3 vault drive failure

Yesterday I received a call about a drive failure in a CLARiiON CX3-80 storage array. Since every system has at least a couple of hot spares configured, usually this does not pose a problem. But this drive failure happened on drive position 0_0_0: a vault drive failure.

The vault drives host (amongst others) the operating system and configuration of the storage array. They are also used to destage the write cache in case of an event that might threaten data integrity.

As an example we’ll take a power failure: write cache is volatile so once the storage runs out of power, the writes located in this cache will be lost. To make sure this type of data corruption never happens, all data in the write cache is written (or destaged) to the vault drives once the power goes down. The temporary power to do this comes from the standby power supplies, or SPS for short, which power the storage processors and the disk enclosure that contains the vault drives. Once the power returns, the storage system starts up again, notices it has uncommitted writes in the vault and writes them on the proper drives throughout the storage system.

Of course this area of space in the vault is protected using parity RAID: it’s no good dumping your write cache to unprotected storage only to have it lost because a vital drive does not spin up after the power failure. But if a vault drive fails, this protection is lost. What happens next is hard to miss:

Write cache disabled after vault drive failure

 A CLARiiON CX3 system will automatically disable the write cache to make sure the data is never lost. It is good to know newer systems like the CX4 or the VNX will not disable write cache, so you only need to worry about this for CX3 and older.

The only way to re-enable write cache is to actually replace the physical drive. The initial failure should have already opened a service request with EMC, but even with premium support it is likely going to take several hours before a new drive is on-site. Since the array with write cache disabled is unbearably slow, a faster workaround is desired. Your best bet at this point is grabbing a hot spare and replacing the defective vault drive.

  1. Identify the size and speed of your vault drive.
  2. Locate a suitable drive in the array that you can scavenge (e.g. a hot spare).
  3. Destroy the hot spare (remove LUN, remove RAID group).
  4. Remove the donor drive.
  5. Remove the defective vault drive and replace it with the donor drive.
  6. Wait for the rebuild to complete. The write cache will re-enable automatically.
  7. Use the replacement drive from EMC to rebuild the hot spare.

As an added note on point 6: once you replace the defective vault drive, the CLARiiON will rebuild the OS part on the vault drive first, followed by the data in a RAID group that is built on that drive. Write cache will ONLY be enabled after all rebuilds (i.e. OS + UserData) have completed! If you have fully allocated the RAID group with LUNs, this means the rebuild will take a considerable amount of time. You can of course play around with the rebuild priorities, but for the quickest possible recovery: keep the vault drives unbound (i.e. don’t create a RAID group on these drives).

  • This one for the “nice to know section”. Thanks Jon!

  • Nice start to the new blog Jon. It does raise one question though. Why wouldn’t a Hot Spare kick in automatically for a drive failure in the Vault since it would anywhere else in the array (at least anywhere in use)?

    • Thanks! Hmm, looks like I need to configure some email notifications on these comments 🙂

      Apparently this is done because the CLARiiON cannot dictate which drive will kick in as hotspare. If this hotspare is outside of DAE0_0, that would mean it is not protected by the SPS. A power failure would then cause your vault to go into degraded mode again, with the added risk of data loss.
      Therefore they have deemed it safer to just not rebuild the OS and rely on all its triple/cross mirrors/parity protection to keep the OS safe.

      The irony is that in this specific array there was a perfectly OK drive located in slot 0_0_14: SPS powered and available for rebuild.