The Linux and Unix Menagerie: Correcting Auto-Disabled Service Groups In Veritas Cluster Server

Friday, February 13, 2009

Correcting Auto-Disabled Service Groups In Veritas Cluster Server

Hey there,

Today's post is going to be fairly specific. I'm either tapped-out on creativity or I'm writing this post after working into the night, or both ;) Here's a little something I picked up today concerning VCS and service groups for either the Unix or Linux version.

If you've ever wondered why, every once in a while, when you have a VCS service group problem and you get this error when you try to correct it:

service group is auto-disabled in the cluster!

you may have had a nervous breakdown at some point in your life ;) Adding to the confusion is that the default help output for the hagrp command only shows an autodisable option (? Either that or I was really tired while I was troubleshooting)

Now, there are good reasons for your service group to go into this state. It doesn't just do it to piss you off ;) For instance, if the basic VCS service (or engine) isn't running on a particular node in your cluster, this attribute gets set. That way, if it comes back online unexpectedly (or someone unexpectedly brings it online while you're working on something else in the VCS config) it won't start contending for resources and causing a possibly more confusing situation. It'll also do this if none of the resources in a service group have been probed (also generally indicative of a "bad bad failure") and if you lose all of your high priority heartbeats. Technically, I think it only does this when you only have disk heartbeats left, but I believe it also does it if you get down to having just one low priority link active and don't use disk heartbeats (I could test this out, but then I might have to keep working until tomorrow morning. I'm at the point where senseless voodoo-thinking has taken over ;)

Luckily, it's a really simple situation to fix, if you know the service group shouldn't be disabled and/or you need to bring an autodisabled service group up. The command line that does that is:

host # hagrp -autoenable YOUR_SERVICE_GROUP_NAME -sys THE_NODE_YOU_WANT_TO_AUTOENABLE_THE_SG_ON

Another way to get around this is to manually probe all of the resources in that service group. It seems like a waste of time to me, but I'm glad the option exists, just in case the above command line fails to do the trick for one reason or another.

host # hares -probe RESOURCE_NAME -sys NODE_YOU_WANT_TO_PROBE_RESOURCE_ON

I find this can be made somewhat simpler by using "hares -display," and piping that to a grep and awk statement, although doing this could result in errors that can end up making your life even more miserable if you don't put in extra echo statements to list out what exact resources are being probed!

And, as my exit nears, another way to do this is to simply use your /etc/VRTSvcs/conf/config/main.cf (assuming it has service groups in it that aren't autodisabled) and get the command line (same as above) by using hacf:

host # hacf -cftocmd /etc/VRTSvcs/config/config/

This will produce a file named "main.cmd", and you can then just "grep -i able" out of it and you should have one line in there, at least, that prints out the exact command you need to run (with your specific service group and system name some times) to unset that flag (or attribute)

Here's to a happy Valentine's day, tomorrow, for almost all of you ;)

Cheers

, Mike