In my previous post I briefly mentioned one of my cooler projects of 2020: closing down a datacenter and consolidated it to our other datacenters at Open Line. The complexity of this project was mostly on the network and “BU carve-out” side of IT, but also featured a very cool Pure Storage FlashArray ActiveCluster component. In fact, using ActiveCluster technology ensured that we kept it all very simple from a data migration and risk perspective. Let me explain why we did what we did, but also start with a disclaimer.
Storage
I’ve previously wrote a howto to upgrade your Isilon cluster’s firmware. The process hasn’t changed much over time: you upload a node firmware package to the cluster, install it, and then apply it to the nodes themselves. Time wise though, it’s not a quick process. Each node will reboot during this process, and the process takes at least 10 minutes per cycle. Imagine a 100+ node cluster and running this process one node at a time…
Fortunately, OneFS 8.2 has recently introduced a nice new tool to make this node firmware upgrade a bit easier and much faster: isi_upgrade_helper. It figures out which nodes can upgrade simultaneously and spits out the relevant commands for you.
Dell EMC uses Secure Remote Services (SRS, formerly known as ESRS) to enhance the tech support experience for their products. There’s two sides to this support: connect home, and connect in. Connect home is your device itself dialing back home to Dell EMC to report various things such as errors, automatic support uploads, etc. If either of this results in a Service Request at Dell EMC, a engineer can then use SRS to dial in / connect in and have a look at the faulty system. The latter saves you from having to host a Webex session.
Dell EMC likes to have all Dell EMC systems connected to SRS, again for two reasons. First of all, it reduces the time spent by engineers in troubleshooting an issue. If an engineer can dial in himself, without having to negotiate a Webex session with the customer, that means more SRs per engineer per day and lower support costs for Dell EMC. Secondly, it will result in faster incident resolution, and thus a happier customer. The support engineer can look up the state of a defective drive independently, and order new parts while the customer is sleeping. Win-win!
I recently installed a new Data Domain DD6300. Part of the whole installation procedure is to run a DD OS upgrade to bring the system up to the target DD OS release. You can find the target releases over here. While running the upgrade to 6.1.2.20, the Data Domain correctly rebooted as part of the upgrade. Logging back in, the system GUI kept throwing an “Upgrade in progress” popup, blocking everything else in view. There is also an alert that shows “DD OS Upgrade is in progress. The system will not be available for backup and restore operations. The alert will be cleared after the upgrade operation is complete.” Which I guess is NEVER when the upgrade is hung…
I’ve installed quite a few new Isilon clusters in 2019. All of them are generation 6 clusters (H400, H500, A200), using the very cool 4-nodes-in-a-chassis hardware. Commonality among all these systems is an 1GbE management port next to the two 10GbE ports. While Isilon uses in-band management, we typically use those UTP ports for management: SRS, HTTP, etc. We assign those interfaces to subnet0:pool0 and make it a static SmartConnect pool. This assigns one IP address to each interface; if you do it right, these should be sequential.
Recent addition to my install procedure is to create some DNS A-records for those management ports. This makes it a bit more human friendly to connect your browser or SSH client to a specific node. In line with the Isilon naming convention, I followed the -# suffix format. So if the cluster is called cluster01, node 1 is cluster01-1, node 2 is cluster01-2, etc. However, it turns out this messes up your SyncIQ replication behavior!