The Linux and Unix Menagerie: How To Modify Live VCS Clusters On Linux And Unix Without Completely Alienating Your Co-Workers

Hey there,

A happy Monday to everyone and here's hoping you had a halfway-decent weekend. There's nothing quite like 2 days of misery leading up to a Monday, which is why I like to do something fun every weekend that I can. This weekend I slept a lot. It was a blast. I'm awake now, but I'm getting right back up on that horse as soon as I'm done proofing this post ;)

If you work on Veritas Cluster Server a lot (VCS from here on in, and on out ;), or even if you just work on them a little (or sit next to someone who does), you probably know how incredibly annoying it can be to make fixes to them once you've got them up and online. They are, almost by definition, expected to provide high availability for whatever twisted perverse software you run on them and "high availability" almost always translates into "high accountability." Even though a perfectly constructed cluster (of however many individual nodes) is supposed to "make your life easier," I've found that this is hardly ever the case. People tend to go ape$h1t whenever a cluster node experiences an issue, even when it's something as mundane as a flip-flop between NIC's on a single node that are sharing a local floating-IP that is but one of many virtual IP's that the entire cluster uses to float its own virtual IP on top of. If you work on clusters, you're statistically likely to be either incredibly apathetic about these sorts of issues, or a ring-tone away from a total nervous breakdown ;)

In the event that you do have to work on an active Veritas cluster (whether on Linux or Unix or even Windows), there are a few ways you can keep those alarm bells (that you have no choice but to trigger in order to get your work done) from reaching their intended recipients. The optimal end result, of course, being that you get your job done in a timely and efficient manner and your co-workers (who aren't in your department and don't need to know - HUGE ASSUMPTION ;) don't lose any sleep worrying about the money the company is conceivably losing while nothing really "bad" is happening.

We'll run down this short list in order from "best practice" to crude severing of email and other notification avenues and/or faking "upness." You may need to do the group of "last things" on this list, no matter how great your setup is. If you're working in an environment where outside agents (HP OpenView, etc) report on your cluster's condition, turning off notification the proper way may not do you any good at all, anyway ;)

Prior to making any changes to your main.cf (assuming you're just changing it with the intention of restoring it to its exact same state), I would recommend either saving off your main.cf file using simple copy:

host # cp main.cf main.cf.`hostname`.`date +%m%d%y%H%M%S`

or creating a command file from your cf file, so that you can just rebuild it again if anything terrible happens, like so:

host # hacf -cftocmd . <-- assuming you're in the main Veritas configuration directory. If you're not, and even if you are, you can pass the hacf command the full path to your configuration directory, if you prefer:
host # hacf -cftocmd /etc/VRTSvcs/conf/config <-- The same thing as the previous command, in most situations.

You can then use the .cmd file to recreate your .cf file using the "-cmdtocf" argument, with the rest of the hacf command generally remaining the same.

1. The first thing you would "normally" do (on the primary node in your cluster), would be to make sure that VCS's notification system redirects its error messages to an alternate source (like your email instead of everyone else's). So, assuming you have the NotifierMngr resource set up to mail errors to everyone@mycompany.com, you can easily modify this without alarming anyone, like so:

host # haconf -makerw
host # hares -modify NotifierMngr SmtpRecipients "me@nobodyelse.com = Error"
host # haconf -dump -makero

Of course, your setup may be more, or less, complicated, but that should give you the gist of it.

2. The next thing you should consider (still going by the VCS playbook) is that some of your resources may have specific "owners" assigned. These people will be notified of issues with certain resources, as well, and should be neutralized whether or not it actually makes a difference ;) You can do this in approximately the exact same way, like this:

host # haconf -makerw
host # hares -modify SOME_RESOURCE ResourceOwner "me@nobodyelse.com"
host # haconf -makero

3. Those first two steps will only insulate you from VCS's notification services. You still have to worry about the system's you're running on. This step can be a big headache (especially when you see what's coming up) and won't always do you any good. I'm including it for completeness' sake. The next thing you could do, would be to disable sendmail/mail on your systems. This can be done a variety of different ways, depending upon what OS and version you're running, but (generally), this simple method should do the trick:

a. Determine the PID's of your active sendmail/mailer daemons and kill them. Alternately, play it safe and shut them down with their respective shutdown scripts (not as much fun ;)
b. Move your sendmail/mail configuration files from their regular location to a different one (or just rename them) so that they can't restart if they try to.
c. Make sure nothing is running on your system, in cron, perhaps, (like cfengine) that will pro-actively fix the problem you've created to keep mail from getting off of the servers.

4. As noted, step 3 can be a hassle and may be a huge waste of time. Now you have to consider that outside, and inside, systems may also be monitoring your cluster's health. Programs like HP OpenView may have agents installed on your local systems and other computer systems may be set up outside of your cluster in order to test its availability from alternate locations. Any combination of these may still trigger errors and alerts once you start faulting a VCS resource on purpose.

At this point you can go with what might be the best option of them all (combined with steps 1 and 2): Stop your cluster dead, properly, so that none of the resources show as faulted or are affected in any way. You can do this very simply by running the following on your cluster's primary node:

host # hastop -all -force

This will bring down your cluster so that you can work on the configuration file, etc, but will not take down any of the resources that it manages (which should buy you some time, since none of them will report as "faulted"). There's still the outside chance that someone (or something) may be monitoring the "state" of your cluster externally. In that cause, you're probably screwed and should let whomever's keeping tabs know what you intend to do so that they will feel more comfortable ignoring it ;) Keep in mind, of course, that, when you bring your nodes back online, you should try to bring up the primary node first. This will avoid any issues with your primary node doing a "remote build" of its configuration file from any secondary, tertiary, etc, nodes. Again, depending on the complexity of your setup, this may make very little difference.

5. (Bonus step - Not recommended) Some folks prefer to "freeze" the cluster when they want to do maintenance. This is perfectly acceptable practice, as it will allow you to operate on resources within service groups in VCS without risking taking down the cluster or causing undesired failover of resources. It should be noted, however, that if you cause any of the basic functionality built into a resource to fail (which may happen if you're re-tooling a resource to a good degree), that resource will still show up in VCS as faulted until its basic functionality (as understood by VCS) is restored. This faulted state will most probably cause the NotifierMngr and ResourceOwner's to be notified of the error condition!

And, there you have it. I realize this was somewhat of a quick glossing over of the subject, but, as those of us who work on VCS have probably quite-painfully learned, trying to explain VCS (even at the most basic level) can become a huge task if you want to cover every possible aspect or angle. That's why so many 5000 page books are published every year on how to open the box and find the serial number ;)

Cheers,

, Mike