VMAX All Flash: Enterprise reliability and SRDF at <1ms latency

Storage Field Day 14 VMAX and XtremIO X2Back in October we visited Dell EMC for a few Storage Field Day 14 presentations. Walking into the new EBC building we bumped into two racks. One with a VMAX all flash system and another with a XtremIO X2. Let’s kick off the Storage Field Day 14 report with VMAX All Flash. There’s still a lot of thought going into this enterprise class storage array…

First some general numbers about the VMAX systems. With all the marketing departments focusing on hyper-converged infrastructure (HCI), you’d almost forget that there are still a lot of companies and critical services running on VMAX back-end storage. 94% of Fortune 500 companies companies choose VMAX for it’s reliability: an actual uptime of 6 nines (99,9999%). Vince Westin joked that you’d come close to a true zombie apocalypse if all VMAX-es in the world fell over at the same time. No more ATMs, no more 911 calls, etc…

VMAX All Flash Family

The VMAX All Flash family continues on this path of reliability, and offers two systems: the VMAX250F and the VMAX950F. The former is a smaller, 1 million IOps and 1PBe system. The latter the bigger, 6.7 million IOps and 4PBe system. Revenue and capacity wise, both systems are equals with each system accounting for 50% of VMAX All Flash revenue.

VMAX All Flash architecture and CPU cores

A VMAX system is composed of bricks, which is basically a director and disks. You can scale a VMAX either up (by adding disks to the existing bricks) or out (by adding additional bricks). Metadata in the VMAX is spread across all the extra bricks, and scaling out results in a pretty linear performance increase.

The directors house all the CPU power in the VMAX. With varying configurations of flash disks and services, it’s difficult to predict which component needs the most CPU power. Therefore, CPU cores are assigned to one of three pools:

  • Front-end core pool
  • Back-end core pool
  • Infrastructure Management (IM) and Enginuity Data Services (EDS) pool

VMAX All Flash Family CPU pools

CPU cores are assigned somewhat static to these pools: a couple of cores can be reassigned, but it’s not fully dynamic yet. This is done on purpose to avoid a “runaway train”-scenario. No doubt, Dell EMC tech support can tweak this a bit further if you’ve got a special use case…

VMAX HYPERMAX (Enginuity) OS upgrades are truly non-disruptive: all directors upgrade simultaneously and stage the new code. They then flip over to the new code in <10 seconds. There’s no rolling outage, no failover/failback. It goes even further: no ports go down, so there’s no RSCN state change notification traveling over your SAN. No host will even know that a code upgrade took place. If one host does have a problem with the new VMAX code, the VMAX All Flash systems can downgrade in the same non-disruptive manner.

Data-at-Rest-Encryption and Compression

Encryption wise, data-at-rest-encryption (D@RE) is always on since there’s no performance impact. Keys are on a per drive basis, and shredded once a drive is removed. This seems good practice: who wants to run the risk of leaking confidential data during a drive swap when there’s no downside to encrypting everything?

The VMAX All Flash systems employ compression to get you double the effective capacity out of the raw solid-state disks. It cheats a little bit while doing this, by not compressing the hottest 20% of capacity that accounts for >80% of IOps. Most systems don’t run more than 60-70% loaded anyway, and this keeps your hottest data as fast as possible. If the array ever fills up to 100%, all the data is compressed and you’ll lose a bit of performance.

VMAX All Flash Adaptive Compression
The performance of compressed I/O is almost identical to uncompressed I/O; the performance difference only becomes apparent once the system fills up.

SRDF all the data!

If you’re somewhat familiar with VMAXes, you might have heard “SRDF is the reason why people buy a VMAX” before. It’s replication software on steroids, with a multitude of supported topologies.

SRDF/Metro is one of the latest SRDF flavors, introduced with the VMAX3 and HYPERMAX OS 5977.691.684. It allows both the primary LUN and the SRFD/Metro replica to be in an active, read/write accessible state. This enables hosts and clusters across metro distances (several (tens) of kilometers apart) to actively read from and write to a protected LUN. You no longer need to use VPLEX for true active-active replication across storage systems. In fact, it’s not even recommended anymore, because a single VMAX all flash system can overload a VPLEX system. And we all like to have fewer devices to manage…

If you need to phase out one VMAX and replace it with another, you can use SRDF to non-disruptively migrate to the new box. This is called VMAX NDM, which works together with your multipathing software to make it a seamless migration for your servers.

My thoughts on the VMAX All Flash

With this Storage Field Day 14 presentation, I think it’s obvious that there’s still a lot of development and R&D going into the VMAX platform. And while development of new features is maybe a bit slower than some of the startups: there’s a legacy to protect. A legacy around reliability and resilience. I’ve been working with plenty of other (Dell EMC and other vendor) products that have a faster R&D track, but also a lot more disruptive bugs.

Obviously, a VMAX is a piece of very expensive hardware and software. But don’t forget that your Tier 1 applications are usually the reason why you’re in business. If your Tier 1 apps go down, you lose your revenue stream and that’s potentially a LOT more than the cost of one array.

After returning from Storage Field Day 14 I got my hands on a pair of VMAX250F arrays running SRDF/Metro. Usability wasn’t the main focus of this this presentation, but there’s a big improvement compared to VMAX2, what I was trained on. Not just for the All Flash systems, but for the VMAX3 in general. SRDF/Metro works like a charm, although expanding LUNs that are protected by SRDF/Metro is currently a pain in the behind. You have to destroy the SRDF group before you can expand Metro LUNs. Luckily Dell EMC is working on new code to make that as seamless as regular SRDF volumes…

Do check out the Dell EMC VMAX presentations at the Storage Field Day 14 website: there’s a ton of material I haven’t covered in this post. And what do you think: should D@RE be enabled by default in a storage system, with GDPR and other data protection laws around the corner?

Disclaimer: I wouldn’t have been able to attend Storage Field Day 14 without GestaltIT picking up the tab for the flights, hotel and various other expenses like food. I was however not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord and because I like what I see and hear.