XtremIO is the new all-flash array from EMC, announced not too long ago. Flash has an enormous performance advantage over traditional spinning disks. Although there are no moving parts in a solid state drive, they can still fail! Data on the XtremIO X-Brick will still need to be protected against one or multiple drive failures. In traditional arrays (or servers) this is done using RAID (Redundant Array of Independent Disks). We could simply use RAID in the XtremIO array, but SSDs behave fundamentally different compared to spinning disks. So while we’re at it, why not reinvent our approach of protecting data? This is where XtremIO XDP comes in.
Redundant Array of Independent Disks
RAID is pretty simple. Spread your data across a large number of drives so that you can benefit from the increased speed each additional drive brings. And add a couple of “redundant” drives so that you can rebuild your data in case a drive fails. Common RAID levels are RAID0 (speed increase, no data protection), RAID1 (mirroring), RAID5 (parity, can cope with one drive failure) and RAID6 (double parity, handles 2 drive failures). Each RAID level has advantages and disadvantages. RAID1 for example offers superior performance but you’ll lose 50% of capacity. RAID6 offers superior protection against two drive failures but a hefty write performance penalty since you’ll need to calculate two sets of parity. RAID0 has the best performance and the most effective use of capacity, but lose a drive and your data is gone…
Protecting data on Flash
So what do you want when protecting data on an all-flash array?
- Low capacity overhead & high performance. Mirroring is out of the question, that will waste 50% of your expensive SSD capacity. Parity protection is much better, dual parity protection like RAID6 offering the best protection. It would be best to use the largest possible stripe size (i.e. amount of disks in the protection group). An X-Brick has 25 drives so we can wide stripe to 23+2. This offers high performance and at the same time low capacity overhead (8%).
- Fast rebuilds but with low impact on host I/O. This is contradictory with the above wide-striping. Rebuilding parity RAID translates to reading all remaining data and parity blocks and reconstructing the missing data blocks. Lots of I/O will be happening during this rebuild, impacting host I/O and also system load.
- Endurance. SSDs degrade over time: flash cells wear with each write cycle. Consumer MLC flash cells can handle anything between 3,000 and 10,000 write cycles. XtremIO uses eMLC, which survives 30,000 cycles. But that doesn’t mean we can waste these cycles.
XtremIO Data Protection (XDP)
To achieve these goals, XtremIO XDP is designed from the ground up to work on an all-flash system. What’s the biggest difference between traditional spinning disk and flash? Random access! Throw a random access pattern at a spinning disk and the drive head will go crazy across the platter to read/write the requested data, resulting in high latency. A solid-state drive does not have this limitation: response times for 100% random data are in line with sequential acccess patterns. So XtremIO XDP doesn’t need to keep data contiguous anymore: it can write it anywhere it wants to.
Writing anywhere you want is awesome in an empty array since you can perform full stripe writes which have a low overhead. But surprise surprise: most customers actually want to store some data on the array! Arrays that are at 80% or more utilization are fairly common. The typical answer to this is implementing garbage collect mechanisms to move data around and create new completely empty stripes. This generates a substantial back-end workload that impacts performance and also wears down the SSD cells with additional write cycles. Increase your utilization to 90% of even higher and garbage collect gets less and less effective since there’s less space to maneuver/free up, resulting in a more frequent garbage collect cycle.
It’s much easier and long-term stable to throw the garbage collect mechanism away and optimize your system to deal for partial stripe updates.
XtremIO XDP ranks stripes in the X-Brick by the percentage of free space. New data is written to the most empty stripe, thus incurring the least amount of write penalty possible. Lets calculate with an 80% full array: 20% free space evenly distributed across stripes means there are some stripes that are completely full and some are 40% empty (2x 20%). A stripe width of 23+2 drives at 40% free space means we can update 9.2 blocks (everything is possible when using statistics; even partial block updates 😉 ). We’ll have to read the existing data (9,2 reads, read the existing parity (2 reads due to dual parity), write two new parity blocks (2 writes) and write the new blocks (9.2 blocks). This totals to a total write penalty of 2.44. Compare this to the worst case penalty of 6 for RAID6 and XDP performs quite nicely!
Sure, compared to a full stripe write this approach still brings along a bit of extra overhead. But there’s no such thing as a free lunch. The alternative is worse: a garbage collect cycle will also impose a pretty big, scheduled load on the back-end and as an added bonus, wear down your SSD cells with the constant data relocations. If you’re using your all-flash array for the workloads it’s built for (high performance, mission-critical applications, driving insane amounts of I/O), that garbage collect is going to interfere with your front-end I/O. And if there’s one thing you don’t want in an enterprise application, that’s unpredictable performance…
Drive failure
So what happens if one of the drives in that massive 23+2 set fails? Is the XtremIO X-Brick going to read back all the data on all the drives to reconstruct the missing drive? And what about hot spares?
Rebuilding data in case of a single disk failure in a parity RAID means reading back all the data from the surviving disks row by row and using the row parity to reconstruct the failed drive. That’s 22 surviving data disks & the (rotating) row parity disk, aka 23 I/Os per stripe rebuild. This is of course a tremendous amount of data and places quite a load on the back-end.
XDP attempts to do this a bit more efficient by keeping some of the rebuild data in memory for a short amount of time to take advantage of the diagonal row parity. Stripes/rows 1 and 2 are rebuilt using row parity (in the above example that’s 5 read I/O’s per stripe rebuild). Now if we want to rebuild block D0-4, we can take advantage of blocks D3-4 and D2-4 that are already in memory, read back D1-4 and Q4 and we’re done. That’s only 2 read I/O’s for stripe number 4 to be rebuilt. The same practice works for stripe 3: read back D4-3 and Q3.
For the X-Brick this means that instead of the 23 I/O per stripe rebuild, you’ll only need approximately 17.25 I/Os per stripe rebuild. A nice 33% performance increase… (or a 33% lower performance impact, if you’re a “glass half empty” guy/gal).
From an administrative perspective, an X-Brick does not have a dedicated hotspare. If one drive fails the X-Brick automatically writes new data in a 22+2 configuration (thus guarantees data protection) and starts to rebuild the existing data. Since there’s no hot spare, capacity for this rebuild is subtracted from the total array capacity, This can go on and on, without major performance impact, until you run out of space or hit the 6 SSDs per X-Brick limit.
Once you reinsert a healthy drive, data is rebuilt and capacity re-added. If your on-site guy accidentally pulls out the wrong disk, he can now quickly push it back in and XtremIO XDP will recognize the disk and abort the rebuild. Be warned; this obviously doesn’t work if you already have a dual disk failure that hasn’t fully rebuilt yet!