Consolidating datacenters with Pure Storage FlashArray ActiveCluster

Purity ActiveClusterIn my previous post I briefly mentioned one of my cooler projects of 2020: closing down a datacenter and consolidated it to our other datacenters at Open Line. The complexity of this project was mostly on the network and “BU carve-out” side of IT, but also featured a very cool Pure Storage FlashArray ActiveCluster component. In fact, using ActiveCluster technology ensured that we kept it all very simple from a data migration and risk perspective. Let me explain why we did what we did, but also start with a disclaimer.

Not really an officially supported ActiveCluster setup

First of all, let’s start with saying that this is not an officially supported ActiveCluster setup. Nor is it a fit for every migration scenario. The distance between the datacenter sites was within the official latency requirements for ActiveCluster. But we didn’t have the full bandwidth that Pure Storage recommends for ActiveCluster. With some RPQ approvals from our very friendly Pure SE team we were allowed to proceed. But we still triggered some benign ActiveCluster latency alerts once the links filled up. So if this blog post gives you a EUREKA! moment, please first consult your friendly Pure Storage SE to discuss feasibility.

200km between datacenters and a tight migration window

In our case, we were closing down one datacenter and moving all workloads to another datacenter approximately 200km’s away. In between those datacenters we had a dark fiber with 10Gbit Ethernet, running all kinds of services and not dedicated for ActiveCluster. We “only” had 3-5Gbit to give to ActiveCluster.

So why did we come up with a Pure Storage FlashArray ActiveCluster had everything to do with the very short downtime window: 4 hours. In these 4 hours we had to do everything: shutdown all applications and services, synchronize the last data, switchover to the 2nd datacenter, modify virtual machines to use new VLANs (we don’t do Layer 2 stretches), change the network routing, then spin everything up again and run tests. We automated and pre-configured as much as possible, combined as many steps as we dared without losing overview and control. And still ended up with basically <10 minutes to do our VM and data migrations. You can’t afford a VMware snapshot consolidation in that window…

The secondary reason was in terms of fallback. In case we would fail the acceptance tests, we would have to roll back the migration without data loss. And preferably within that same 4 hour window, but definitely not much longer than that. We couldn’t afford migration software to re-index all the changes and replicate them back. With ActiveCluster keeping both volumes active and in sync, there was no sync-back time. A potential rollback would just be a reverse migration that could be executed immediately.

Start with async replication…

So how did we set this up, after discussing the case with Pure Storage and getting the “great plan, perform at your own risk!”-green light?

In our destination datacenter we already had Pure Storage PaaS (Pure as a Service) storage operational, so we just had to get a 2nd, small FlashArray/X for the source datacenter. Once that system was installed (my first FA install, with some shadowing from Remko), we asked Pure support to set up Active Clusters between the locations. With 200km between the datacenters, mirrored write latency between the systems was fairly high. This had absolutely nothing to do with the installed Pure Storage systems, but everything with physics. If the distance increases, light needs more time to travel across the link and latency increases. This is why traditionally, 200km is sort of the upper limit on which to run synchronous replication.

In our case, latency was within the limitations that Pure Storage imposes on an ActiveCluster setup, but not something I’d want to run tier 1 production on for an extended amount of time. So we created the new volumes and set up Protection Groups based on Asynchronous replication. This allowed us to move all virtual machines onto the source FlashArray/X without performance impact, and still replicate all the initial data in the background to the destination FlashArray/X.

… then migrate using ActiveCluster

We upgraded the asynchronous replication (pgroup) to ActiveCluster replication (pods) in the hour before the actual data migration. Pure Storage has a helpful KB article on it which I recommend you read, but it basically involves changing the connection between the arrays to sync-replication and then moving the volumes out of the pgroups and into the pod. There’s a short (<1 minute) resync and shortly thereafter everything is in active-active synchronous mode. As we did this when the app teams were already shutting down servers, the performance impact on write latency was acceptable at this point. It also meant that we could add the VMFS volumes to the ESXi hosts in the secondary datacenter even before the application teams started shutting down the virtual machines.

During the migration we didn’t have any cross-datacenter host links, only local access.

We then had to wait for the application teams to finish the VM and container shutdowns and the network team to give us the thumbs up that everything was ready to go. At that point, Gabrie kicked off his PowerShell script to move all virtual machines onto the ESXi hosts in the secondary datacenter, change their VMNetwork settings to the new VLAN, and everything was ready to boot again. This process took less than one minute, after which the application teams could restart the virtual machines again.

Fortunately, we completed the acceptance test and didn’t need to fail back. With the migration complete, I changed the ActiveCluster setup to point to a different PaaS array in a nearer datacenter. And after a short initial synchronization, we were fully protected again. Job done!

My thoughts on migrating with ActiveCluster

ActiveCluster definitely made our migration job a lot easier and faster. This allowed us to focus our time and energy on different aspects of the migration. While this solution might not fit all migration use cases, it’s a method I’d consider in the more critical migration paths where we are in a time squeeze.