I recently expanded two 3-node Isilon X210 clusters with one additional X210 node each. The clusters were previously installed with OneFS 7.x, and upgraded to OneFS 126.96.36.199 somewhere late 2018. A local team racked and cabled the new Isilon nodes, after which I added them to the cluster remotely via the GUI. Talk about teamwork!
A brief time later the node actually showed up in the isi status command. As you can see in the picture to the right, something was off: the SSD storage didn’t show up as Isilon L3 cache. A quick check did show that the hardware configuration was consistent with the previous, existing nodes. The SmartPool settings/default policy was also set up correctly, with SSDs employed as L3 cache. Weird…
Isilon L3 cache settings
Isilon nodes are grouped in node pools. If you have multiple slightly different nodes, you can create a compatibility rule which for example allows X200 nodes to join the X210 node pool. This cluster only has X210 nodes with identical configuration in it, so that means the default settings pool applied.
As you can see, the Isilon shows 4 nodes, with the SSDs configured as L3 cache. However, it isn’t enabling on the 4th node.
You can also get this overview via the CLI, using the isi storagepool nodepools list -v command:
With all settings correct, I resorted to support.emc.com. I ended up with KB article 524088, with the catchy title: SSD Can’t Be Converted to L3 and Provisioned in Node Pool Due To 2:1 HDD to SSD ratio restriction in OneFS 8.1
KB 524088: Isilon HDD to SSD ratio
A quick isi devices drive firmware list command later, and… yep! 6 SSDs and 6 HDDs.
The KB article also shows why this suddenly happened: a new feature in the OneFS 8.1.x code now requires a node to have at least twice as many HDDs as SSDs before it will allow L3 cache to be enabled.
The article doesn’t explain why this has suddenly changed in the new OneFS version, so I can only speculate here. I expect it has something to do with efficient usage of the available SSD capacity, giving Isilon customers the most bang for buck. A system benefits from a certain percentage of cache to speed up IO. If you have too little cache for your dataset, performance is suboptimal. If you have too much cache, more than your working dataset can really use, performance doesn’t increase too much but your cost does. In those cases, it’s more efficient to use the SSDs for actual data storage, instead of caching.
The solution in the KB article is also simple: contact your Dell EMC account team and replace SSDs with HDDs until you hit the 2:1 ratio. This brings the ratio back to 2:1, plus it should also give you a bit more capacity.
There is of course also the other option of disabling Isilon L3 cache and using the SSDs for data storage. We have the SmartPools license, however this would deviate from the standard building block and introduce another layer of management. We would then have to decide which data deserves to be on the SSDs. It’s best to let the system handle it automatically with L3 cache, so that’s not an option!
My thoughts on this issue
The KB article perfectly explains what happened. Our system was deployed with OneFS 7.x, which allowed us to enable L3 cache. Later last year, the operations teams upgraded all Isilons to 8.1.x code. Promptly thereafter I came along and expanded two of their systems with an additional node, hitting this issue.
If my speculation is anything close to the mark, it makes perfect sense why this code improvement was implemented. I would have expected this to be pushed out to MyQuotes as well though, so pre-sales teams would get a big red flag when they quote a system with 1:1 HDD:SSD ratios. This probably slipped through the cracks.
Fortunately, there’s no real risk for production though. The new, fourth Isilon node has added capacity to the cluster. The system is stable while we wait for the response from Dell EMC on how to proceed. In the meantime, if you want to read a bit more about the different Isilon cache levels, check out this whitepaper.