Roughly 6-7 years ago (around 2012), flash storage became affordable as a performance tier. At least, for the companies I was visiting. It was the typical “flash tier” story: buy 1-2% of flash capacity to speed everything up. All-flash storage systems were still far away into the future for them. They existed, and they were incredibly fast, but they also drove the €/GB price too far up, out of their reach.
However, in the background you could already hear the drums: it is going to be an all-flash future! Not just for performance, but also for capacity/archive storage. In fact, one of those people beating that drum was my colleague Rob. I recall vividly our “not yet!”-discussions…
And it makes sense. Solid-state drives are:
- More reliable: there are no moving parts in SSDs, and media failures are easier to correct with software/design.
- Power consumption is very low at rest: there is no little motor to keep platters spinning 24/7.
- Faster: the number of heads and the rotational speed of the platters limit a hard drive’s performance. Not so with flash!
They are still quite expensive, looking at €/TB. Fortunately, cost is coming down too. The last year or two, all flash arrays have taken flight in general-purpose workloads. Personally, I have not installed a traditional tiered SAN storage system in over a year anymore. Hyper-converged infrastructure: same story, all flash. The development of newer, cheaper types of QLC flash only helps close the gap in €/GB between HDD and SSD. But there is still a 20x gap. And one company we met at Storage Field Day 18 has a pretty solid plan to bridge that gap: VAST Data.
Meet VAST Data
VAST Data was founded in 2016 and they released an alpha version of their product in 2017. Their mission is to “kill the HDD”. In effect, what they build is a system that delivers all-flash performance (writing at SCM speeds, reading at TB/s / millions of IOPs), at tier 5 cost efficiency and exabyte scale, with NFS and S3 protocols on the front-end.
They do this by using the lowest cost flash available, not wasting any space on excessive metadata and redundancy, and employing some clever data reduction algorithms.
How low can you go!
The price difference in flash is endurance based, not performance based. The cheapest QLC flash survives only 500 write cycles. Put this flash in a typical, generic workload all-flash array, and it will wear out in several weeks. On the other side of the spectrum is Enterprise QLC, which can cope with 50.000 write cycles but is much more expensive.
VAST Data uses the 500 write cycle QLC because it is cheapest. In fact, they claim they could work with even cheaper flash that only has a 50 write cycle lifespan. That flash would be torn to shreds if every individual write lands on it, so the VAST Data appliance needs a write buffer. This write buffer is created from several 3D Xpoint SSDs per controller, sized to roughly 1% of the total QLC flash capacity. The 3D Xpoint is solely used for write caching; there is no read caching, as the QLC flash backend is fast enough. The RAM in the appliance also doesn’t do caching, but is used for some locking information.
Existing all-flash storage arrays have a large write amplification, mostly due to garbage collection and NAND writing patterns. This is often in the range of 50-100x. VAST Data does this differently by writing to the cheap QLC in “super erase blocks”. This saves a lot of unnecessary write cycles.
The smallest VAST Data you can buy is a 1PB unit, consisting of one 2U JBOF (or data node) and 4 containers/servers in a 2U chassis. For more technical specs on those devices, check Dan’s post. The SSDs are then extremely wide striped: 150+4 or 500+10 for the bigger systems. VAST Data keeps this wide striping CPU friendly during rebuilds by inventing a new erasure coding map that doesn’t need to read all drives.
How is this kept CPU friendly? New erasure coding map invented by @VAST_data, doesn’t need to read all drives. In practice: 500+10 needs to read 500/10=50 drives for rebuilding #SFD18 https://t.co/NHkzqnexJH
— Jon Klaus (@JonKlaus) February 27, 2019
The data is then presented to the outside world over NFS and S3, with other protocols to be announced when the product matures and demand for these protocols increases.
VAST’s Data Reduction works with similarities between data
Historical barriers to aggressive data reduction are always a tradeoff between fine granularity and a global view across nodes in a cluster. Fine granularity for compression (byte range) is a local process. Coarse granularity for deduplication (KB-range) is a global process.
What VAST Data does is similarity-based, global data reduction. Data is fingerprinted in large blocks after the write is persisted in SCM. These fingerprints are then compared to measure relative distance: similar chunks are then clustered. The clustered data is then compressed together (yielding high compression ratios) and a reference block is stored. Lastly, byte-level deltas on this reference block are extracted and stored.
SCM accounts for approximately 1% of total storage capacity in a VAST Data system. In traditional systems, when dumping write cache to disks, the disks are the bottleneck. In the VAST data architecture, the flash backend is so fast that the front-end is the bottleneck. That front-end is 8x 50GbE per server chassis, so pretty fast!
It is also possible that the CPU is the bottleneck at some point. To fix it, you should spawn more containers. 4 containers in a single chassis have 80 cores available, so there’s quite some compute power in stock already. Should CPU load become a problem though, the processes in the system will prioritize migration of data from SCM to Flash over ingress of new writes.
With the VAST Data system, you can adopt a hyperscaler approach when replacing failed drives. Failed drives result in reduced capacity, but redundancy is automatically repaired. You will be replacing drives from a capacity perspective, not from a redundancy perspective. VAST Data is not unique with this approach; other systems use the same approach. But it sure is a good approach from an operational perspective.
While it is true that an SSD as a whole will fail less often than an HDD, partial drive failures are more common. So how does VAST Data handle partial drive failures, with these massive super erasure blocks spread across 150-500+ drives? That’s more of a question towards the SSD itself.
Most of the SSDs shut down entirely since they have a contract with the application to present a certain size. If the SSD can support partial failures, VAST Data will handle this partial failure, lose some capacity and rebuild the lost capacity on a different drive. We’ve seen this at a previous SNIA presentation; for example Open Channel SSDs can work with this concept.
My thoughts on VAST Data
VAST Data is approaching the storage market in a radical new manner. Using technology that wasn’t available or affordable before 2018 (NVMe-oF, QLC Flash and Storage Class Memory). Squeezing the most capacity out of SSDs, with for example very wide striping and innovative data reduction technologies. Everything is aimed at reducing overhead and SSD requirements, and thus enabling VAST Data to buy the cheapest possible SSDs on the market.
I don’t think we’ll ever completely say goodbye to HDDs. At least not within the next 10 years. Offline tape is still around, and there will always be some use cases that favor bulk, lowest cost media. That said: that’s not the majority of IT infrastructure.
I am very intrigued by VAST Data’s presentation and product. They did an incredible job in explaining their product, and I look forward to seeing them grow and evolve.
Check out VAST Data’s presentation over here. There’s also posts from Chris and a podcast from Enrico which I can highly recommend you read or listen to.
Disclaimer: I wouldn’t have been able to attend Storage Field Day 18 without GestaltIT picking up the tab for the flights, hotel and various other expenses like food. I was however not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord and because I like what I see and hear.