The Linux and Unix Menagerie: Cluster Server Failover Testing On Linux And Unix

A fine how do you do :)

WARNING/GUARANTEE: Today's post is the product of a tired mind that just finished working and didn't have much time to think beyond the automatic. If you feel you may be entertained, please continue reading. If you want to learn some really useful tricks to test a two-node cluster's robustness, this article may be for you, too. If you're looking for information you can apply in the workplace without being escorted from the building by armed guards, proceed no further ;)

As today's post title obliquely suggests, I'm going to take another day to properly formulate my response to our "F" reading experiment (not to be confused with the anti-literacy initiative ;) that we began on Monday. I've received a number of very interesting comments on the subject of the article that got the whole idea rolling around in that butter churn in between my ears. Although none of the responses have radically changed my feelings on the subject, they have augmented them and provided some fresh perspective. Although I still intend to throw a little signature "meta" style into the post (because if we all read in the F formation, my post is going to have to play off of that to whatever degree I can manage :), I'm now reconsidering my original rough-draft and, possibly, working some additional angles into it. I've got some emails out there (as I always request permission to use other folks' opinions when they're kind enough to share) and hope to hear back soon. Worst case, I'll post the original tomorrow and add the comments as their own entities (attached, of course) at a later date.

Also, as this post's title overtly suggests, I spent most of my day testing cluster failover scenario's at work. I won't mention any proprietary or freeware brand names, as this post isn't specific enough to warrant the reference, but, after today's exercise (which, of course, I've had to do more than a couple different ways at a couple of different places of employment) I decided to put together a small comprehensive list of two-node cluster disaster/failure/failover scenarios that one should never push a cluster into production without performing.

It goes without saying that the following is a joke. Which is, of course, why I "wrote" it with my lips sealed ;)

Comprehensive Two-Node Cluster Failover Testing Procedure - v0.00001alpha

Main assumption: You have a two-node cluster all set up in a single rack, all service groups and resources are set to critical, no service groups or resources are frozen and pretty much everything should cause flip-flop (technical term ;)

1. Take down one service within each cluster service group (SG), one at a time. Expected result: Each cluster service group should fault and failover to the secondary node. The SG's should show as faulted in your cluster status output on the primary node, and online on the secondary.

2. Turn all the services, for each SG, back on, one by one, on the primary node. Expected result: All of the SG's should offline on the secondary node and come back up on the primary.

3. Do the same thing, but on the secondary. Expected result for the first test: Nothing happens, except the SG's show as faulted on the secondary node. Expected result for the second test: Nothing happens, except the SG's show as back offline on the secondary node.

4. Switch SG's from the primary to secondary node cleanly. Expected result: What did I just write?

5. Switch SG's from the secondary node back to the primary node cleanly. Expected result: Please don't make me repeat myself ;)

6. Unplug all heartbeat cables (serial, high priority ethernet, low priority, disk, etc) except one on the primary node. Expected result: Nothing happens except, if you're on the system console, you can't type anything anymore because the cluster is going freakin' nuts with all of its diagnostic messages!

7. Plug all those cables back in. Expected result: Everything calms down, nothing happens (no cluster failover) except you realize that you accidentally typed a few really harmful commands and may have hit enter while your screen was draped with garbage characters. The primary node may be making strange noises now ;)

8. Do the same thing on the secondary node. Expected result: No cluster failover, but the secondary node may now be making strange low beeping sounds and visibly shaking ;)

9. Pull the power cords out of the primary node. Expected result: Complete cluster failover to the secondary node.

10. Put the plugs back in. Expected result: Complete cluster failback to the primary node.

11. Do the same thing to the secondary node. Expected results for both actions: Absolutely nothing. But you knew this already. Are you just trying to waste the company's time? ;)

12. Kick the rack, containing the primary and secondary node, lightly. Expected results: Hopefully, the noises will stop now...

13. Grab a screwdriver and repeatedly stab the primary node. Expected Result: If you're careful you won't miss and cut yourself on the razor sharp rack mounts. Otherwise, everything should be okay.

14. Pull the fire alarm and run. Expected result: The guy you blame it on may have to spend the night in lock-up ;)

15. Tell everyone everything's fine and the cluster is working as expected. Expected result: General contentment in the ranks of the PMO.

16. Tell everyone something's gone horribly wrong and you have no idea what. Use the console terminal window on your desktop and export it via WebVNC so that everyone can see the output from it. Before exporting your display, start up a program you wrote (possibly using script and running it with the "-t" option to more accurately reflect realistic timing, although a bit faster. Ensure that this program runs in a continuous loop. Expected Result: General pandemonium. Emergency conference calls, 17 or 18 chat sessions asking for status every 5 seconds and dubious reactions to your carefully pitched voice, which should speak in reassuring terms, but tremble just slightly like you're a hair's breadth away from a complete nervous breakdown.

17. Go out to lunch. Expected Result: What do you care? Hopefully, you'll feel full afterward ;)

Cheers,

, Mike