troubleshooting

8 posts

Reassign Isilon node IP addresses; go OCD!

A while ago I installed two new Isilon H400 clusters. With any IT infrastructure, consistency and predictability is key to a trouble-free experience in the years to come. Cables should be neatly installed, labeled and predictable. Wiring in the internal network cables, it helps if the nodes 1 through 4 are connected to switch ports 1 through 4 in order, instead of 1,4,2,3. While some might consider this OCD, it’s the attention to detail that makes later troubleshooting easier and faster. Like a colleague said: “If someone pays enough attention to the little details, I can rest assured that he definitely pays attention to the big, important things!”.

So I installed the cluster, configured it, then ran an isi status to verify everything. Imagine my delight when I saw this:

Isilon nodes before reassigning node IPs

Aaargh!

Continue reading

Isilon node loses network connectivity after reboot

Isilon H400 chassis with serial cable attachedIn my previous post I described how to reformat an Isilon node if for some reason the cluster creation is defective. After we got our new Gen 6 clusters up and running, we ran into another peculiar issue: the Isilon nodes lose network connectivity after a reboot. If we would then unplug the network cable and move it to a different port on the Isilon node, the network would come online again. Move the cable back to the original port: connectivity OK. Reboot the node: “no carrier” on the interface, and no connectivity.

Continue reading

Reformat an Isilon node and try again!

Isilon H400 chassis with serial cable attachedWhile installing a new Dell EMC Isilon H400 cluster, I noticed node 1 in the chassis was acting up a bit. It allowed me to go through the initial cluster creation wizard, but didn’t run through all the steps and scripts afterwards. I left the node in that state while I installed another cluster, but after two hours or so, nothing had changed. With no other options, I pressed Ctrl + C: the screen became responsive again and eventually the node rebooted. However, it would never finish that boot, instead halting at “/ifs not found”. Eventually, it would need a reformat before it would function properly again…

Continue reading

VNX Unified standby data mover interface suspended after Cisco NX-OS upgrade

We’re in the midst of a VCE vBlock 340 software upgrade. Part of this upgrade process is upgrading the Cisco Nexus 5K switches that connect the blades and storage to the customer network. After upgrading the switch we suddenly noticed on the switch that the VNX Unified standby data mover (server_3) interface suspended with a “no LACP PDUs” error message. A quick check on the switch that wasn’t upgraded yet showed that interface to be online. So what’s up with that?

Continue reading

Users don’t care if an application is supported, as long as it works

Working vs SupportedLast week we migrated several Oracle databases to a new DBaaS platform. The company I’m working for is in the midst of a datacenter migration to a new cloud provider. Since the Oracle databases were located on old and very expensive Oracle machines, we looked for opportunities to optimize and reduce costs. After much debate, we decided to move all databases to a shared Oracle Exadata platform. Much faster, and much cheaper: the hardware is more expensive, but you win it back with lower licensing costs (less sockets used).

All the Oracle database migrations went pretty well: stop app, export database, transfer to new DC, import & start database. The app teams updated their connection strings and tested the apps. Pretty painless! However there were also some scripts working alongside the databases, mainly for data loads. Server names changed and some scripts had to be moved from the old database servers to the application servers.

Continue reading

Making life a whole lot easier with Tintri VM-aware storage

Tintri IO BlenderAccording to Tintri, the rise of server virtualization broke the traditional storage system. Initially we had relatively simple environments where one server talks to a number of LUNs on a storage system. Sometimes we’d have a small cluster of servers accessing those volumes. Still relatively simple.

Fast forward to now: large clusters of hypervisor hosts are the norm, collectively accessing an even larger number of volumes. Each hypervisor in turn hosts a large number or virtual machines. In case of performance problems, how are you ever going to figure out the root cause and which other systems are affected?

Continue reading

ScaleIO Architecture and failure units

ScaleIO logoI had the opportunity to play with a new EMC product last week: ScaleIO. It’s definitely not a new EMC product (I troubleshooted the 1.31 version and EMC released 2.0 at EMC World 2016) but I just hadn’t had the honor to work with one of those systems yet. ScaleIO is a software-defined storage solution that uses the local disks in your commodity server and shares these out as block LUNs across the Ethernet. Which means this architecture can scale pretty well, both on capacity and performance, using hundreds (if not thousands) of servers and disks.

Continue reading

Deleting an Isilon folder – Operation not permitted

Deleting an Isilon folder - Operation not permittedWhen deleting an Isilon folder, you might come across some peculiar behavior. When browsing with a file explorer to an SMB share and deleting a folder, the operation apparently succeeds and the folder disappears. When refreshing the share however, the folder is back. Resorting to an SSH session to delete the folder, you get an Operation not permitted error and the rm/rmdir command fails.

Continue reading