Monday, November 19, 2007

Recovering Faulted VCS Resources In An Indefinitely Hung State!

If you've ever found yourself in a situation where a Veritas Cluster Server (VCS) resource/service group is reporting faulted, you may have seen the situation described here. Today we're going to focus on one specific condition where the most common wisdom I've seen on the web is to do an "hastop -local -force" and follow up with an "hastart" to solve the problem. Common sense dictates that this is the last thing you'll want to do, unless you're on a testing environment, as you'll be completely downing the cluster in the process.

The reason the stop/start answer is the most common is because, since the resource/service group's "resource" is waiting to go OFFLINE, while the resource/service "group" is trying to start, or go ONLINE, simply attempting to "online" or "offline" either only results in infinite hang. This is because each entity's successful action depends on the other entity's failure and VCS won't fail if it can wait instead.

You'll know you're looking at this sort of error if you see the following (this generally can be found by running "hasum" or "hastatus -summary"):

A. The resource/service "group" will be showing in a "STARTING|PARTIAL" state
B. An individual "resource" within that resource/service "group" will be showing in the "W_OFFLINE" state.

The following steps to resolution are certainly not the only way to get everything back to an "up" state and, also, assume that there is nothing really "wrong" with the individual resource or resource/service group. That sort of troubleshooting is outside the scope of this blog-sized how-to.

So, again, our assumptions here are:
1. An Oracle database resource/service group, named "oracledb_sg" has faulted on the server "vcsclusterserver1.xyz.com."
2. An individual resource, a member of the resource/service group "oracledb_sg," named "oracledb_oracle_u11192007," is really the only thing that's failed, or shows as failing in "oracledb_sg."
3. There is actually nothing wrong with the resource or the resource/service group. Somehow, while flipping service groups back and forth between machines, somebody made an error in execution or VCS ran into a state problem that it caused by itself ("split brain" or some similar condition).
4. Note that we've reached these assumptions based partially on the fact that the resource is waiting to go OFFLINE, and the resource/service group is waiting to go ONLINE (stuck in STARTING|PARTIAL), on the same server!

And the following are the steps we could take to resolve this issue and get on with our lives:

1. First, get a summary of the resource group that you've been alerted as failed or faulted, like this:

hastatus -summary|grep oracledb_sg|grep vcsclusterserver1 (or some variation thereof, depending on how much information you want to get back)

B oracledb_sg vcsclusterserver1.xyz.com Y N STARTING|PARTIAL
C oracledb_sg Oracle oracledb_oracle_u11192007 vcsclusterserver1.xyz.com W_OFFLINE


2. If you want to be doubly sure, you'll check that you get the same resource id ( oracledb_oracle_u11192007 ) from the "hares" command, and also get the same status:

hares -display|grep oracledb|grep "waiting to" --> to get the resource id a second way.

3. At this point you still can't "online" the resource until you "clear" it (of the W_OFFLINE state flag).

hares -clear oracledb_oracle_u11192007 -sys vcsclusterserver1.xyz.com
hagrp -online oracledb_sg -sys vcsclusterserver1.xyz.com


Now your resource/service group should be back in the straight ONLINE state, and you shouldn't see any messages (In "hasum" or "hastatus -summary" output) regarding the individual resource. Time to relax :)

, Mike