Last year EMC installed two new three-node Isilon clusters for one of our customers. 13 months later the customer needs more capacity so the time for adding Isilon nodes to the existing clusters has finally come. Good news for me: since the initial install I got certified to install Isilon systems myself so these expansions are all mine! EMC marketing promises an Isilon cluster expansion in 60 seconds; let’s put it to the test!
The customer uses the Isilon clusters for long-term archive storage for their PACS systems: data will have to be stored for 15 years or longer. In essence this means that no data will be thrown away and the systems will keep on filling up for at least the next 15 years.
The clusters that need to be expanded are three-node NL400 Isilon clusters running OneFS 7.0.2.7. 4U nodes, 36 drives each + CPU, RAM and network interfaces. Installing them in the rack is fairly straightforward (read the initial cluster install here): install the rails, slide in the node, insert the drives (24 in the front drive bays of the NL400, 12 in the rear bays) and attach the front panel. When inserting the drives, please check that you’re installing the correct drives for the node: the packages with drives have a label on the outside with the node S/N. Also check that there is no foam trapped in the connector of the drive. Next up, install the Infiniband cabling, external network (make sure the network team has the switchports pre-configured!) and power cables. And that’s it for the physical portion in the datacenter…
Next up lets head to the OneFS management page to perform some pre-checks. Of course you’re not going to expand a cluster that’s not healthy: check the cluster overview for errors. If everything is OK, navigate to the network provisioning rules. Ideally you have a set of provisioning rules that automatically assigns the new nodes interfaces to the correct sub
nets and pools.
Finally, make sure there’s no firmware package installed on the cluster. This is a lesson in “follow the procedure to the letter”. The procedure to upgrade the firmware on an Isilon cluster clearly states at the end to remove all firmware packages. I thought to be smarter than the procedure: “Oh, I’m adding a node in a short while anyway; lets just keep the package on the cluster so it can automatically update the firmware.” Well.. no. If you have a firmware package installed on the cluster, adding Isilon nodes is not possible since the whole process will halt with an “error downloading new image from cluster” error message. Your LCD front-panel will go blank and you will only see this error in the OneFS GUI. Follow the steps in the screenshot above to remove the firmware package (or search for emc166311 on support.emc.com for the full KB article) prior to adding the Isilon nodes.
Time for the actual expansion: press the little power button on the rear of the server, wait a couple of minutes for the LCD on the front of the server to come alive and use the buttons on the front panel to select and confirm the cluster you want to join. If you’re using dedicated Infiniband switches per cluster and haven’t enabled secure join mode this should just be a single push on the square button. You can make a “beep” sound while you press the button if you like…
That’s it! The node will download the necessary images and downgrade or upgrade its OneFS version automatically depending on the cluster version. Once everything matches it will join the cluster. It might throw an error about the network interfaces not having an IP address; this should automatically resolve if your network provisioning rules and networking team have set up everything correctly. Do ask them to check all the interfaces came up alright and whether everything looks good from a networking side.
The Isilon custer will automatically start to rebalance data across the available cluster nodes with the MultiScan job. This rebalancing happens across the internal Infiniband network so no worries: your external network will not be burdened with TBs of traffic. It might take a long, long time though. You can check the progress on the Operations page, speed it up by adjusting the impact policy for the running job or pause it to allow other jobs to complete first. For example on the secondary cluster (only used in case of DR) I’ve adjusted the default impact from Low to Medium to speed it up a bit…
Back to the 60 seconds claim. Yes, if your new node has the same OneFS version as the cluster, the joining process is over in less than a minute. And as soon as the node is attached to the cluster, the capacity is available for the end-users! If you include racking and stacking and checking a couple of things in advance, it will take you somewhere close to an hour or two to do it neatly. Not quite 60 seconds, but still immensely fast and easy compared to other systems out there!