Isilon

24 posts

Connecting your Dell EMC systems to SRS, the easy way!

Dell EMC uses Secure Remote Services (SRS, formerly known as ESRS) to enhance the tech support experience for their products. There’s two sides to this support: connect home, and connect in. Connect home is your device itself dialing back home to Dell EMC to report various things such as errors, automatic support uploads, etc. If either of this results in a Service Request at Dell EMC, a engineer can then use SRS to dial in / connect in and have a look at the faulty system. The latter saves you from having to host a Webex session.

Dell EMC likes to have all Dell EMC systems connected to SRS, again for two reasons. First of all, it reduces the time spent by engineers in troubleshooting an issue. If an engineer can dial in himself, without having to negotiate a Webex session with the customer, that means more SRs per engineer per day and lower support costs for Dell EMC. Secondly, it will result in faster incident resolution, and thus a happier customer. The support engineer can look up the state of a defective drive independently, and order new parts while the customer is sleeping. Win-win!

As such, Dell EMC motivates us partners to connect all new systems to SRS. I have been doing that for some years now, but noticed I was using an antiquated approach. It turns out many of the new systems have REST API-based methods to register themselves with SRS. Here’s how!

Continue reading

Isilon SyncIQ uses incorrect interface: careful with mgmt DNS records!

I’ve installed quite a few new Isilon clusters in 2019. All of them are generation 6 clusters (H400, H500, A200), using the very cool 4-nodes-in-a-chassis hardware. Commonality among all these systems is an 1GbE management port next to the two 10GbE ports. While Isilon uses in-band management, we typically use those UTP ports for management: SRS, HTTP, etc. We assign those interfaces to subnet0:pool0 and make it a static SmartConnect pool. This assigns one IP address to each interface; if you do it right, these should be sequential.

Recent addition to my install procedure is to create some DNS A-records for those management ports. This makes it a bit more human friendly to connect your browser or SSH client to a specific node. In line with the Isilon naming convention, I followed the -# suffix format. So if the cluster is called cluster01, node 1 is cluster01-1, node 2 is cluster01-2, etc. However, it turns out this messes up your SyncIQ replication behavior!

Continue reading

PSA: Isilon L3 cache does not enable with a 1:1 HDD:SSD ratio

Isilon L3 cache not enablingI recently expanded two 3-node Isilon X210 clusters with one additional X210 node each. The clusters were previously installed with OneFS 7.x, and upgraded to OneFS 8.1.0.4 somewhere late 2018. A local team racked and cabled the new Isilon nodes, after which I added them to the cluster remotely via the GUI. Talk about teamwork!

A brief time later the node actually showed up in the isi status command. As you can see in the picture to the right, something was off: the SSD storage didn’t show up as Isilon L3 cache. A quick check did show that the hardware configuration was consistent with the previous, existing nodes. The SmartPool settings/default policy was also set up correctly, with SSDs employed as L3 cache. Weird…

Continue reading

Reassign Isilon node IP addresses; go OCD!

A while ago I installed two new Isilon H400 clusters. With any IT infrastructure, consistency and predictability is key to a trouble-free experience in the years to come. Cables should be neatly installed, labeled and predictable. Wiring in the internal network cables, it helps if the nodes 1 through 4 are connected to switch ports 1 through 4 in order, instead of 1,4,2,3. While some might consider this OCD, it’s the attention to detail that makes later troubleshooting easier and faster. Like a colleague said: “If someone pays enough attention to the little details, I can rest assured that he definitely pays attention to the big, important things!”.

So I installed the cluster, configured it, then ran an isi status to verify everything. Imagine my delight when I saw this:

Isilon nodes before reassigning node IPs

Aaargh!

Continue reading

Isilon node loses network connectivity after reboot

Isilon H400 chassis with serial cable attachedIn my previous post I described how to reformat an Isilon node if for some reason the cluster creation is defective. After we got our new Gen 6 clusters up and running, we ran into another peculiar issue: the Isilon nodes lose network connectivity after a reboot. If we would then unplug the network cable and move it to a different port on the Isilon node, the network would come online again. Move the cable back to the original port: connectivity OK. Reboot the node: “no carrier” on the interface, and no connectivity.

Continue reading

Reformat an Isilon node and try again!

Isilon H400 chassis with serial cable attachedWhile installing a new Dell EMC Isilon H400 cluster, I noticed node 1 in the chassis was acting up a bit. It allowed me to go through the initial cluster creation wizard, but didn’t run through all the steps and scripts afterwards. I left the node in that state while I installed another cluster, but after two hours or so, nothing had changed. With no other options, I pressed Ctrl + C: the screen became responsive again and eventually the node rebooted. However, it would never finish that boot, instead halting at “/ifs not found”. Eventually, it would need a reformat before it would function properly again…

Continue reading

Isilon Tech Refresh – Replacing old NL400 Isilon nodes for NL410’s

Last month I’ve performed a Isilon tech refresh of two clusters running NL400 nodes. In both clusters, the old NL400 36TB nodes were replaced with 72TB NL410 nodes with some SSD capacity. First step in the whole process was the replacement of the Infiniband switches. Since the clusters were fairly old, an OneFS upgrade was also on the list, before the cluster could recognize the NL410 nodes. Dell EMC has extensive documentation on the whole OneFS upgrade process: check the support website, because there’s a lot of version dependencies. Finally, everything was prepared and I could begin with the actual Isilon tech refresh: getting the new Isilon nodes up and running, moving the data and removing the old nodes.

Continue reading

Linux 101: Disown long running jobs like an InsightIQ database upgrade

Long InsightIQ upgrade processIf you’re remotely managing a Linux machine, you’ll probably use an SSH connection to run commands on that machine. There’s one problem with this approach: if you close the SSH connection, any long-running jobs/commands will halt. If you know a job will take a long time and you won’t be able to babysit the SSH connection, you can plan accordingly. But what if you underestimated the time a job will take, and you need to disconnect anyway? Here’s how to keep the job running AND make it home in time for dinner!

Continue reading

Linux 101: Extend root partition of an InsightIQ VM

InsightIQ old free spaceWhile upgrading OneFS it’s important to keep the InsightIQ software version compatible with the Isilon systems. In this case, InsightIQ wasn’t updated for a while and I had to upgrade from 3.0 -> 3.1 -> 3.2 -> 4.x. The actual upgrade process isn’t too hard (it just takes a lot of time), but there’s one little prerequisite in the 3.1 -> 3.2 upgrade: a minimum free space in the root partition of 502MB. As you can see in the screenshot, I wasn’t even close to the minimum requirement. I got to 357 MB, and that’s after cleaning up redundant stuff. Time to add some more disk space and extend root partition!

Continue reading

Safely replace an Isilon InfiniBand Switch with these steps

Isilon Infiniband switch 8 port MellanoxEvery once in a while you might need to replace an Isilon infiniband switch. Possibly because of a broken switch, the need for more ports, or because the old switch is.. too old. Good news: it’s a fairly straightforward job.  And if your cluster has two switches, you can replace a switch at a time without outage.

Continue reading