Recently I’ve ran a project for a new EMC Symmetrix VMAX 10k installation. The install was a breeze and the migration of data to the system fairly straightforward. The customer saw some good performance improvements on the storage front and there is plenty of capacity in the system to cater for immediate growth. Yet when I opened the Unisphere for VMAX interface and browsed to the performance tab, my heart skipped a beat. What are those red queue depth utilization bars? We were seeing good response times, weren’t we? Were we at risk? How about scalability? Lets dig deeper and find out.
Several weeks ago we performed a resiliency test on a VMAX10k that was about to be put into production. The customer wanted confirmation that the array was wired up properly and would respond correctly in case there were any issues like a power or disk failure. This is fairly standard testing and makes sense: better to pull a power plug and see the array go down while there is no production running against it, right?
We pulled one power feed from each system bay and storage bay. Obviously: no problem for the VMAX, it dialed out, notified EMC it lost power on a couple of feeds and that’s it. Next up we yanked out a drive: I/O continued, the array dialed out to EMC that something was wrong with a drive, but… we didn’t see anything in Unisphere for VMAX!
Not all data is accessed equally. Some data is more popular than other data that may only be accessed infrequently. With the introduction of FAST VP in the CX4 & VNX series it is possible to create a single storage pool that has multiple different types of drives. The system chops your LUNs into slices and each slice is assigned a temperature based on the activity of that slice. Heavy accessed slices are hot, infrequently accessed slices are cold. FAST VP then moves the hottest slices to the fastest tier. Once that tier is full the remaining hot slices go to the second fastest tier, etc… This does absolute wonders to your TCO: your cold data is now stored on cheap NL-SAS disks instead of expensive SSDs and your end-users won’t know a thing. There’s one scenario which will get you in trouble though and that’s infrequent, heavy use of formerly cold data…
When troubleshooting performance in a CLARiiON or VNX storage array you’ll often see graphs that resemble something like this: write cache maxing out to 100% on one or even two storage processors. Once this occurs the array starts a process called forced flushing to flush writes to disk and create new space in the cache for incoming writes. This absolutely wrecks the performance of all applications using the array. With the MCx cache improvements made in the VNX2 series there should be a lot less forced flushes and a much improved performance.
A big change in the VNX2 hot spare policy compared to earlier VNX or CLARiiON models is the use of permanent hot spares. Whereas the earlier models would have dedicated, configured hot spares that would only be used during the drive failure, the VNX2 will use any eligible unused drive to spare to and NOT switch back to the original drive. I’ve written about this and other new VNX2 features but didn’t get to try it first hand yet. Until now: a drive died, yay! Continue reading to see how you can back-track which drive failed, which drive replaced it and how to move the drive back to the original physical location (should you want to).
In September 2013 EMC announced the new generation VNX with MCx technology (or VNX2). The main advantage of the new generation is a massive performance increase: with MCx technology the VNX2 can effectively use all the CPU cores available in the storage processors. Apart from a vast performance increase there’s also a boatload of new features: deduplication, active-active LUNs, smaller (256MB) chunks for FAST VP, persistent hotspares, etc. Read more about that in my previous post.
It took a while before I could get my hands on an actual VNX2 in the field. So when we needed two new VNX2 systems for a project, guess which resources I claimed to install them. Me, myself and I! Only to have a small heart attack upon unboxing the first VNX5400: someone stole my standby power supplies (SPS)!
Over the last couple of months I’ve been busy phasing out an old EMC CLARiiON CX3 system and migrating all the data to either newer VNX and/or Isilon systems. The hard work paid off: the CX3 is now empty and we can start to decommission it. But before we ship it back to EMC we need to employ a type of CLARiiON data erasure to make sure data doesn’t fall into wrong hands.
LUNs on a storage system represent the blobs of storage that are allocated to a server. A (VNX) storage admin creates a LUN on a RAID Group or Storage Pool and assigns it to a server. The server admin discovers this LUN, formats it, mounts it (or assigns a drive letter) and starts to use it. Storage 101. But there’s more to it than just carving out LUNs from a big pile of Terabytes. One important aspect is LUN ownership: which storage processor will process the I/O for that specific LUN?!
Anyone working in IT knows that there are usually enormous amounts of whitepapers available to help you install, configure and run a new system or software suite. The fun more than doubles when the whitepapers start conflicting themselves. But even when they’re crystal clear, sometimes you run into a different problem: budget! With all planning and designing done, sometimes the budget or the purchased equipment does not allow you to follow ALL best practices to the letter, or at least make it a bit more challenging. In this example there’s the need to span a storage pool across DAE 0_0.
Recently I ran into an environment with a couple of VNX5700 systems that were attached to the front-end SAN switches with only two ports per storage processor. The customer was complaining: performance was OK most of the time but at some times during the day the performance was noticeably lower. Analysis revealed that the back-end was coping well with the workload (30-50% load on the disks and storage processors). The front-end ports were a bit (over)loaded and spewing QFULL errors. Time to cable in some extra ports and to rebalance the existing hosts over the new storage paths!