When troubleshooting performance in a CLARiiON or VNX storage array you’ll often see graphs that resemble something like this: write cache maxing out to 100% on one or even two storage processors. Once this occurs the array starts a process called forced flushing to flush writes to disk and create new space in the cache for incoming writes. This absolutely wrecks the performance of all applications using the array. With the MCx cache improvements made in the VNX2 series there should be a lot less forced flushes and a much improved performance.
Cache vs Back-end
Your storage array stores data in two locations: temporarily in cache and permanently/persistently on the back-end or disks. In an ideal world you would be constantly working from cache, simply because it’s super fast. However since cache is also very expensive there’s a limited amount of it in an array. Your cache functions as a buffer: if an application dumps a massive amount of writes to your array they will end up in write cache and be acknowledged back to the server. In the background the array will start to flush these writes to the persistent storage or disks. This way the application doesn’t have to wait for the slow disks, which is good! Think of your write cache as a big bucket with a small hole in the bottom: you can pour in a lot of water (or IOs) quickly and it will trickle out to the disks at a slower rate. But what if your workload isn’t a short burst of writes but a sustained write workload? Even worse: a workload that can’t be handled by the little holes or disks? Eventually the cache/bucket is full and the application will have to wait before it can send in more I/O…
Cache is a shared component in a VNX. You’re most likely attaching more than one server to your VNX. So all your applications are throwing small cups of water in the big cache bucket and life’s good. All of a sudden an application opens the proverbial fire hose and quickly fills the bucket to the brim. Now all other applications will have to wait till there’s room in the cache to throw in their cups or writes. And chances are once there’s room in the cache the fire hose application will start writing also and quickly fill it up again.
CLARiiON and VNX1 cache behaviour
This is exactly what happened in the CLARiiON and VNX arrays. Once the write cache fills up the array will start forced flushing: all host I/O will be suspended and the dirty pages in write cache (which is the unsaved data that hasn’t been flushed to disk yet) will be flushed till it reaches the low watermark. It’s pretty obvious that this is absolutely detrimental to host response time and the process also doesn’t discriminate: ALL host I/O is suspended, not just the stream/server that caused the cache to hit 100% full.
You’d be surprised how often this occurs. A customer has a number of high performance applications that are housed on a pool or RAID group with fast drives. No problem there. They also have a development environment that is infrequently used and housed on nice and cheap NL-SAS drives. The problem starts once for example someone restores a development database and this cheap NL-SAS RAID group can’t handle the incoming I/O anymore: it fills up the cache and your high performance application (on disks that are maybe even pretty idle!) will have to wait for space in the cache just like the rest…
VNX2 MCx cache: pre-cleaning age and write throttling
From an administrative perspective the cache in a VNX2 changes a bit. You no longer need to configure the balance between read and write cache and there are also no longer low or high watermarks. The VNX2 MCx cache algorithms use dynamic watermarks and constantly evaluate the effectiveness of the cache: using the cache to buffer short bursts of I/O and prevent sustained workloads that overwhelm a RAID group from using up all the system resources and cache.
MCx does this by monitoring the rate if incoming I/O and successful page flushes from cache to the RAID group. These two metrics will basically tell the algorithm whether or not a RAID group is saturated and not handling the incoming workload well. MCx also checks the number of free cache pages and the effectiveness of the workload coalescing (to turn random I/O in sequential I/O) and turn this total of four metrics into a pre-cleaning age. It does this for each individual RAID group in the array, be it a regular RAID group or a private RAID group in a storage pool.
If a cache page is younger than this pre-cleaning age it can stay in cache. If it’s the same age as the pre-cleaning age limit it will be flushed to disk. And if the age is older than the pre-cleaning age MCx will start flushing to disk faster for that RAID group or – if the RAID group is already at its max flush rate – start to throttle incoming writes.
With write throttling the VNX2 will delay the acknowledgement sent back to the host. This buys MCx time to learn about the incoming workload and adjust the pre-cleaning age accordingly (basically decrease it). Throttling continues until the incoming write workload matches the capability of the underlying RAID group. This makes sure that the cache is not overrun by a single application and that your other applications aren’t adversely impacted by it.
In the left graph you can see a workload overloading the write cache in a VNX1. SPB is pretty much constantly forced flushing, so I/O response time will be bad. The VNX with the MCx cache optimizations on the other hand also starts to head towards 100% cache full but quickly starts to throttle the incoming writes. MCx starts to adjust and learn the appropriate pre-cleaning age for the I/O going to that RAID group and levels off. This leaves ample room in the cache for other applications to buffer incoming writes.
Another benefit of the improved cache management is that MCx seems to squeeze out a bit more I/O from the same hardware. Whereas FLARE managed to squeeze out roughly 513 IOps from a 4+1 R5 group with 300GB 10k drives, MCx averages at 900 IOps: a 150% improvement. The workload itself is identical: IOmeter with 1 worker, 16 outstanding IOs per target, 100% random 8KB unaligned writes.
In summary…
The improvements introduced in MCx go further than just the raised ceiling of 1 Million IOps for the VNX platform: “normal” workloads also benefit from things like the MCx cache optimalization which no longer allows a single workload to hog all system resources. This will make performance management a lot easier, especially on systems that are shared by a whole lot of different applications, servers or tenants.
If you would like to learn more about the MCx caching algorithms (since I’ve completely skipped the read cache behaviour in this post), go and find the “MCx Multicore Everything” whitepaper on support.emc.com. Or if you have any questions, leave them below.