SNIA: Avoiding tail latency by failing IO operations on purpose

SNIA logoConsistency and predictability matter. You expect Google to answer your search query within a second. If it takes two seconds, that is slow but ok. Much longer and you will probably hit refresh because ‘it’s broken and maybe that will fix it’.

There are many examples that could substitute the scenario above. Starting a Netflix movie, refreshing your Facebook timeline, or powering on an Azure VM. Or in your business: retrieving an MRI scan or patient data, compiling a 3D model, or listing all POs from last month.

Ensuring your service can meet this demand of predictability and consistency requires a multifaceted approach, both in hardware and procedures. You can have a modern hypervisor environment with fast hardware, but if you allow a substantially lower spec system in the cluster, performance will not be consistent. What happens when a virtual machine moves to the lower spec system and suddenly takes longer to finish a query?

Similarly, in storage, tiering across different disk types helps improve TCO. However, what happens when data trickles down to the slowest tier? Achieving that lower TCO comes with the tradeoff of less latency predictability.

These challenges are not new. If they impact user experience too much, you can usually work around them. For example, ensure your data is moved to a faster tier in time. If you have a lot of budget, maybe forgo the slowest & cheapest NL-SAS tier and stick to SAS & SSD. But what if the source of the latency inconsistency is something internal to a component, like a drive?

Tail latency and hyperscalers

There is average latency and then there’s the little spikes in latency that ruin your day. These spikes in latency are called tail latency and can be 2-10x higher than the average latency, according to SNIA at Storage Field Day 12. They are not rare either: 1.5 to 2.2% of all IOs are negatively impacted by tail latency. If you consider that most application activities require a multitude of IOs to the back-end storage, this is not a problem you’ll only experience in a synthetic test.

SNIA explains tail latency

Tail latency can be triggered by background tasks like garbage collection, scrubbing, remapping, cache flushes and self-testing. Or they could be caused by a bad part of a drive, i.e. media errors. So how do you design around this tail latency? Maybe you will throw more hardware at it and lower the average latency, so the spikes still fall within the acceptable range of latency.

However, this is more of a workaround than a real solution. If you are a hyperscaler, a very large storage consumer like Facebook, Amazon, Google or Microsoft Azure, this will be a costly workaround. It is estimated that 50% of all produced storage is shipped to those hyperscalers, so do the math if these guys need to overprovision. Fortunately, the hyperscalers have the power to request new features from the drive vendors. SNIA talked about several of these initiatives.

Fail fast, DePop and streams

One of these initiatives is adding a per I/O tag that indicates whether a drive can fail fast and return an error if it takes too long to retrieve the data. If there’s a replica of the data somewhere else, it might just be faster to retrieve the data from there, instead of waiting for the slow drive to respond. The other side of the coin is a “try really hard” I/O tag, that indicates you’ve exhausted all other options and really need the data from this drive.

If a drive is consistently experiencing bad latency in a section of LBA ranges, your data data center monitoring software might be tempted to mark the drives as failed. However, this is not beneficial to your TCO: maybe just one chip or one R/W head is bad. With another initiative called DePop you would be able to fail part of a drive and return the remaining healthy part of the drive back to service. Capacity would be reduced, but that is not much of a problem for modern web scale systems such as object stores.

SNIA also talked about streams, which is a concept that associates multiple blocks with an upper level construct such as a file or object. You hardly ever delete a block by itself, but you do delete files or objects. If the drive was aware of this behavior, it would simplify garbage collection mechanisms, improve performance and increase the life of the drive due to less write amplification.

My thoughts on the SNIA presentation

When I started out in storage and was learning about the various protocols and products, SNIA was mentioned every other page. While they started out to solve interoperability issues in storage networks, SNIA is now 20 years old and they are not sitting idly by. SNIA is involved in a number of initiatives, ranging from storage management interoperability to flash and persistent memory, security, cloud and object drives.

SNIA advancing storageWe all know that the hyperscalers play according to different rules compared to the rest of the market. Its enlightening to see this extends not only to the datacenter software and RAID levels, but also to the internal drive behavior. The protocol enhancements introduced in this manner will trickle down to the regular consumer market and find their way into software defined products for the ‘normal’ enterprise and SMB products.

My fellow Storage Field Day 12 delegates also wrote a post on the SNIA presentation, namely Chan and Dan. While researching I also ran into this excellent post on latency benchmarking. And as always, I recommend you watch the videos from the SNIA presentation on the Tech Field Day website, which you can find here.

Disclaimer: I wouldn’t have been able to attend Storage Field Day 12 without GestaltIT picking up the tab for the flights, hotel and various other expenses like food. I was however not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord and because I like what I see and hear.