A big change in the VNX2 hot spare policy compared to earlier VNX or CLARiiON models is the use of permanent hot spares. Whereas the earlier models would have dedicated, configured hot spares that would only be used during the drive failure, the VNX2 will use any eligible unused drive to spare to and NOT switch back to the original drive. I’ve written about this and other new VNX2 features but didn’t get to try it first hand yet. Until now: a drive died, yay! Continue reading to see how you can back-track which drive failed, which drive replaced it and how to move the drive back to the original physical location (should you want to).
Finding the original drive and hot spare
This system is configured with one big pool of 50 600GB 2,5″ drives and two unbound 600GB 2,5″ drives for hot sparing purposes. Initially drive 1_0_19 started to experience soft-media errors and eventually failed. The rebuild claimed unbound drive 1_0_24 and rebuilt the data. Two days later drive 1_0_24 started throwing errors as well and failed; the VNX selected drive 0_0_4 for the rebuild. The data that is now on 0_0_4 will not be copied back to the fresh 1_0_19 drive, as that’s the way the VNX2 hot spare system works. So how can you find where MCx moved your data? Time to dive into the SP event logs!
With Unisphere open and your system selected, head to System > Monitoring and Alerts > SP Event Logs. Open both SP event logs since both logs are different and in my case, the hot spare alerts are on SPB. You should know the time and date of the drive failure based on the alert: browse until you find something like this (clickable pics):
You can see drive 1_0_19 throwing errors until (on line 760) drive 1_0_24 is swapped in. We can do the same for the second drive failure:
Line 317 at the top shows the rebuild of drive 1_0_24 to 0_0_4. If you want to reduce the spam somewhat and just look at the rebuilds, filter your event log for event code 0x7168. This also helps you find out whether or not the proactive drive replacement process was started or not. For example, for the initial drive failure of 1_0_19 the drive just failed, whereas for the second drive failure the proactive copy was started (but did not finish in time):
In most cases you would just replace the failed drives and the system would continue to run without problem. But what if, for some reason, you wanted to move the data back to drive slot 1_0_19? Well… no problem!
MCx Drive Mobility
There are two ways to move the data back to the original drive slot. You can use the copytodisk command: no need to get out of your (hopefully) comfy chair but does put the VNX to work which might not be desired in certain high load conditions. The other way is using the MCx Drive Mobility feature: remove a drive and reinsert it within 5 minutes and the VNX will recognize it again. No rebuild required, fast and easy. First of all though, check that all rebuilds have finished: you don’t want to pull out a drive that’s in the middle of a rebuild. You can use naviseccli -h <ipaddr> getdisk <diskID> -rb, for example, getdisk 0_0_4 -rb. Next; swap the drive positions and you’ll see the following in the logs!
You can see the unbound disk being pulled (line 36), followed by the disk that contains data (line 34). Your RG or storage pool will probably not be idle while you do this, so once you reseat your drive there will be some amount of missed writes. These missing writes are logged (line 29, rebuild logging). Upon re-inserting the drive containing the data in 1_0_19 (line 22) the rebuild logging stops (line 21) and the missed writes are rebuilt on the drive (line 20 and 18). Lastly, I’ve reinserted the empty, unbound drive in 0_0_4 (line 19). And we’re done!
My thoughts about Drive Mobility: it’s an amazing feature… and I would probably NOT use it in production to move drives around! Especially in a 25-disk DAE with the tiny 2,5″ drives it’s incredibly easy to pull the wrong drive. Don’t forget you’re moving healthy drives, so there’s no drive fault LED to guide you. Anyone that works with VNX systems regularly knows that drive numbers start at 0. But picture the scenario where someone starts counting at 1 and pulls the wrong drive, then immediately pulling another one (since he’s doing a swap). Poof, two drive failures: bye bye data. Yeah the system will log the writes, but if you need to read from a RAID5 RG that now has two drives down… hmmm. So my advice: unless you really know what you’re doing and paying attention, use the copytodisk command!