Isilon Tech Refresh – Replacing old NL400 Isilon nodes for NL410’s

Last month I’ve performed a Isilon tech refresh of two clusters running NL400 nodes. In both clusters, the old NL400 36TB nodes were replaced with 72TB NL410 nodes with some SSD capacity. First step in the whole process was the replacement of the Infiniband switches. Since the clusters were fairly old, an OneFS upgrade was also on the list, before the cluster could recognize the NL410 nodes. Dell EMC has extensive documentation on the whole OneFS upgrade process: check the support website, because there’s a lot of version dependencies. Finally, everything was prepared and I could begin with the actual Isilon tech refresh: getting the new Isilon nodes up and running, moving the data and removing the old nodes.

Racking & stacking and adding the NL410 nodes to the cluster

First part of the Isilon tech refresh is racking and stacking the new NL410 nodes. For fun and giggles, you can read this old post about the initial racking and stacking of the old NL400 nodes we’ll be phasing out.

Not much has changed in this whole process:

  • Install the rails in the rack. It’s all tool-less with thumb screws. Don’t forget to install the cage nut that will hold the node in place.
  • Rack the node. The node itself is not too heavy since the drives aren’t installed yet, but it’s still advisable to have someone assist you: sliding the node into the rail is a bit of a “whack-a-mole” task. You will need a screwdriver to install the screws that holds the node in the rack.
  • Isilon Tech Refresh - Don't Blow Nintendo CartridgeInstall the drives. There’s no specific position or order in which you need to install the drives, so plug away. Big improvement over the old NL400 nodes is that the drives are now packaged in hard plastic instead of foam. Hence no stray bits of foam in the connectors, and no need to blow into the connectors anymore. Finally, install the front panel: press on the left part with the little PCB first, then press it on gently but firmly.
  • Cable everything up: Infiniband, front-end network and power. Once you’re happy with everything, press the small black power button on the back of the system to boot the node.
  • Once the front panel comes alive (and assuming your OneFS join method allows it), you should see a prompt to join the existing Isilon cluster. Accept and wait: depending on version differences between the old and new nodes, it could take a while.

Repeat the above steps for all other new nodes. If you don’t have enough rack space, install at least 3 nodes of the new type. This is the minimum amount of nodes you need to form a node pool. Then install new nodes while you phase out old nodes and make room.

Reconfigure SmartPools

First, we’ll change a few SmartPools settings:

Isilon tech refresh - SmartPools settings

Navigate to the SmartPools settings for the default SmartPool policies. Enable “Use SSDs as L3 Cache by default for new pools”. Be aware this only works as a default setting for new pools: ideally you’d enable this setting before adding nodes. If the Isilon already generated a new NL410 node pool, you will have to enable L3 Cache manually at the node pool:

Node pool L3 Cache

Next, we’ll change the File Pool policy. Unfortunately, these clusters do not have Smart Pools licensed, so I could not move all the data from the old NL400 nodes to the NL410 nodes with a file pool policy. We can however change the default policy, and it will help a little bit in funneling data to the new pool instead of the old pool.

I changed the “Move to Storage Pool or Tier” and “Move Snapshots to Storage Pool or Tier” settings to the specific NL410 pool. Additionally, as a cluster setting the Data Access Pattern is set to “streaming” for application IO pattern benefits (low amount of clients that need high bandwidth). Be aware that you might not be able to change these settings in your own environment, so pick and choose the applicable settings.

Patience is a virtue: autobalancing and ProtectPlus

You should now have a cluster overview with old and new nodes. What I like to do now is… nothing! Wait for a bit: the addition of the new nodes will have triggered several OneFS jobs such as FlexProtect and MultiScan. These jobs restripe protected files and directories across pools and nodes. Monitor their progress and, where necessary, increase their impact policy to speed things up..

Once they finish, kick off a SetProtectPlus, followed by an AutoBalance job. These jobs will apply the SmartPools default policy and balance capacity across the compatible pools and nodes in the Isilon system.

Isilon Tech Refresh - Cluster Info after initial balancing
Nodes 6 and 7 did get some IP addresses, but I removed them in this screenshot. Also, the yellow warnings are due to insufficient IP addresses in the SmartConnect pool: something that resolved itself automatically once excess nodes were removed.

Eventually you’ll end up in some sort of a balance based on percentage used disk space. Now comes the real Isilon tech refresh: smartfailing the old nodes so they can be removed from the cluster.

SmartFail old Isilon nodes

Isilon tech refresh - Smartfail Isilon Node

Select an old node you want to phase out, and click submit. You’ll get a warning asking you to confirm your decision. I opted to work my way down from node 4 to node 1: node 1 ran the web page, so I didn’t want to relogin each time.

You can smartfail multiple nodes at the same time. For the first cluster I failed the old nodes one at a time. The downside of this approach is that OneFS will try to move this data to nodes in the same node pool. Hence it will move the data from node 4 to nodes 1-3. Later, when you fail node 3, it will pick it up again and try to move it to node 1-2. This feels inefficient: data is picked up and moved multiple times.

In an ideal world, you’d have a SmartPools license and you’d be able to move all the data to the new pool using a policy. This would make the whole smartfail process much quicker, since there would be no data to move.

Anyway, for the second cluster I opted to smartfail two nodes at a time.

Isilon tech refresh - Smartfail in progress

In our case, node 1 and 2 received all data from node 3 and 4. It took another AutoBalance job to level out node utilizations across the Isilon cluster and onto the new nodes. Finally, I smartfailed node 1 and 2.

Once all jobs complete and the nodes disappear from the Cluster Overview, check their status in Cluster Configuration -> Hardware Configuration. If the old nodes show up in the “add nodes” page, they are truly gone from the cluster. Alternatively, their front panels should show a “add node to cluster” message. You can now power off the old NL400 nodes by holding the back power button for a few seconds.

Cleaning up the old node pool

Once you remove all the old nodes, the old NL400 node pool might still be visible and in attention state in the SmartPools overview. There’s a method of removing this deprecated/orphaned node pool using the disi command.

CAUTION & DISCLAIMER: the disi command is an advanced command, normally reserved only for Isilon tech support. Use it incorrectly and you will lose data. In case of even the slightest doubt, contact Dell EMC support and ask them to assist you. If you ruin your system with the disi command, don’t blame me. And don’t yell at the Dell EMC engineer when he/she can’t fix it, because he/she can’t restore something that’s gone!

List your node pools using disi -I diskpools ls (that’s an capital i for INTERNAL). You’ll see the old and new disk/node pools.

Isilon tech refresh - Delete old node pool

n400* signifies the old pool, so that’s what I had to delete. In the above screenshot, I started with deleting all the child entries (n400_36tb_48gb:<number>). For another cluster, I started with deleting the parent group (n400_36tb_48). The only observed difference is that, in the first method, the “orphan:#” nickname is only visible after the delete commands. In the other method where you first delete the parent group, all the other entries immediately change into “orphan:#” entries. Afterwards, you delete them using the orphan names.

At the end you should end up with an clean CLI output showing you only the hardware that’s actually in the cluster:

Nodepools clean

And a similar GUI:

Nodepools clean 2

Now, run the Isilon Gather script which you can find under Cluster Management > Diagnostics. This will generate a new package of logfiles and ship them off to Dell EMC. Also, inform Dell EMC that you’ve phased out your old nodes and taken your new nodes into production. Finally, update ESRS: remove old nodes, add the new serial numbers.

Any tips, tricks or questions regarding this Isilon tech refresh? Please comment below!