I’ve installed quite a few new Isilon clusters in 2019. All of them are generation 6 clusters (H400, H500, A200), using the very cool 4-nodes-in-a-chassis hardware. Commonality among all these systems is an 1GbE management port next to the two 10GbE ports. While Isilon uses in-band management, we typically use those UTP ports for management: SRS, HTTP, etc. We assign those interfaces to subnet0:pool0 and make it a static SmartConnect pool. This assigns one IP address to each interface; if you do it right, these should be sequential.
Recent addition to my install procedure is to create some DNS A-records for those management ports. This makes it a bit more human friendly to connect your browser or SSH client to a specific node. In line with the Isilon naming convention, I followed the -# suffix format. So if the cluster is called cluster01, node 1 is cluster01-1, node 2 is cluster01-2, etc. However, it turns out this messes up your SyncIQ replication behavior!
SyncIQ using the wrong interfaces
Once I got both clusters up and running, I typically move the InsightIQ database to the new systems first. While setting up the SyncIQ replication, I immediately noticed that the replication to the secondary system took too long. The secondary cluster ingress was only 2Gbps. This was odd, as I would have expected something in the 10-20Gbps range. For the curious readers: only 2 nodes per cluster are currently connected to the network, as the customer ran out of 10GbE switch ports. Once the old Isilons are phased out, more ports are free and we can connect all nodes redundantly with 2x 10GbE.
I verified the SmartConnect zones: all good. SmartConnect zone DNS lookups were good. And the SyncIQ policies referenced the correct SmartConnect zone on the target system + restricted the worker connections to that zone & nodes.
I noticed something on the target (secondary) cluster though. Viewing the Local Targets, the SyncIQ coordinator IP was off. It showed an IP address in the management pool instead of the replication pool.
force_interface and Source Based Routing
Digging in the Dell EMC KB, I found a CLI command that helps force SyncIQ to only use the interfaces in the pool for communication between the source and target clusters. This is disabled by default, so looked like a good place to start. The command is: isi_classic sync config –force_interface=on
After enabling this on both clusters and recreating the SyncIQ policy, the coordinator IP address was displayed correctly on the target cluster. However, traffic was still ending up on the wrong interfaces.
Next in the troubleshooting chain was Source Based Routing. Enabling this on an Isilon cluster basically ensures that traffic leaves the Isilon on the same interface as it comes into the cluster. It changes some of the routing and gateway decisions, particularly useful in complex networking setups. Read more on SBR over here. Our setup didn’t qualify as complex, however it doesn’t really hurt to enable it on a new cluster anyway.
At that point, traffic was leaving the source cluster on the right interface. It was however still entering the target cluster over the 1GbE mgmt-1 interfaces. At the end of my troubleshooting knowledge, I asked Dell EMC for some help.
SyncIQ skip_lookup flag
After a 3 hour Webex session, the engineer pointed to KB 334043. This article explains using the skip_lookup SyncIQ setting, useful in situations where you NAT the Isilon IP addresses or have a split horizon DNS. The SyncIQ skip_lookup flag is used to prevent local name resolution lookup of target nodes. With –skip_lookup disabled (which is the default), the coordinator constructs node name “node1” and does a short hostname-to-IP lookup against it. If it returns an IP it will use that instead. Enabling the skip_lookup flag skips this whole lookup, and SyncIQ will simply use the IP addresses the target cluster returned in the initial handshake.
Downside to this is that the –skip_lookup flag needs to be enabled for every individual SyncIQ policy. If someone forgets it, that SyncIQ policy will replicate over the wrong interfaces, potentially causing issues. And the nagging voice in my head (don’t tell my doc!) kept complaining that this was overly complex for a otherwise very simple and straightforward Isilon deployment.
Lookups… DNS… management IP addresses… waaaiiit!
Yep. After removing the A-records for the individual management IP addresses of the nodes and restarting the replication, it suddenly used the 10GbE ports. It turns out that creating A-records which use the Isilon internal node names (cluster01-1, cluster01-2, cluster01-3, etc) is a big NO-NO. So be careful with the DNS records you create. Using a slightly different name, e.g. cluster01-node1, will most likely be perfectly fine. We’ll create this the next time we’re on-site, as it makes life a bit easier compared to remembering a bunch of IP addresses.
I’m a teeny tiny bit disappointed that the support engineer didn’t catch this. He was okay with enabling a flag on each and every current and future SyncIQ policy. Thus further pushing the Isilon away from a “standard” install. I did update the SR afterwards, to let him know about the root cause. Hopefully it ends up in the/an KB article. Good confirmation for myself that I shouldn’t fall asleep during a Webex session though!