Friday, November 14, 2008

Basic Veritas Cluster Server Troubleshooting

Hey There,

For the end of the week, we're going to continue with the theme of sparse-but-hopefully useful information. Quick little "crib sheets" (preceding by paragraphs and paragraphs of stilted ramblings by the lunatic who pens this blog's content ;) For this Friday, we're going to come back around and take a look at Veritas Cluster Server (VCS) troubleshooting. If you're interested in more specific examples of problems, solutions and suggestions, with regards to VCS, check out all the VCS related posts from the past year or so. Hopefully you'll be able to find something useful in our archives, as well. These simple suggestions should work equally well for Unix as well as Linux, if you choose to go the VCS route rather than some less costly one :)

And, here we go again; quick, pointed bullets of info. Bite-sized bits of troubleshooting advice that focus on solving the problem, rather than understanding it. That sounds awful, I know, but, sometimes, you have to get things done and, let's face it, if it's the job or your arse, who cares about the why? Leave that for philosophers and academics. Plus, since you fix problems so fast, you'll have plenty of time to read up on the ramifications of your actions later ;)

The setup: Your site is down. It's a small cluster configuration with only two nodes and redundant nic's, attached network disk, etc. All you know is that the problem is with VCS (although it's probably indirectly due to a hardware issue). Something has gone wrong with VCS and it's, obviously, not responding correctly to whatever terrible accident of nature has occurred. You don't have much more to go on than that. The person you receive your briefing from thinks the entire clustered server set up (hardware, software, cabling, power, etc) is a bookmark in IE ;)

Now, one by one, in a fashion that zigs on purpose, but has a tendency to zag, here are a few things to look at right off the bat when assessing a situation like this one. Perhaps next week, we'll look into more advanced troubleshooting (and, of course, you can find lots of specific "weird VCS problem" solutions in our VCS archives)

1. Check if the cluster is working at all.

Log into one of the cluster nodes as root (or a user with equivalent privilege - who shouldn't exist ;) and run

host1 # hastatus –summary

or

host1 # hasum <-- both do the same thing, basically

Ex:

host1 # hastatus -summary

-- SYSTEM STATE
-- System State Frozen

A host1 RUNNING 0
A host2 RUNNING 0

-- GROUP STATE
-- Group System Probed AutoDisabled State

B ClusterService host1 Y N OFFLINE
B ClusterService host2 Y N ONLINE
B SG_NIC host1 Y N ONLINE
B SG_NIC host2 Y N OFFLINE
B SG_ONE host1 Y N ONLINE
B SG_ONE host2 Y N OFFLINE
B SG_TWO host1 Y N OFFLINE
B SG_TWO host2 Y N OFFLINE


Clearly, your situation is bad: A normal VCS status should indicate that all nodes in the cluster are “RUNNING” (which these are). However, it should also show all service groups as being ONLINE on at least one of the nodes, which isn't the case above with SG_TWO (Service Group 2).

2. Check for cluster communication problems. Here we want to determine if a service group is failing because of any heartbeat failure (The VCS cluster, that is, not another administrator ;)

Check on GAB first, by running:

host1 # gabconfig -a

Ex:

host1 # gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 3a1501 membership 01
Port h gen 3a1505 membership 01


This output is okay. You would know you had a problem at this point if any of the following conditions were true:

if no port “a” memberships were present (0 and 1 above), this could indicate a problem with gab or llt (Looked at next)

If no port "h" memberships were present (0 and 1 above), this could indicate a problem with had.

If starting llt causes it to stop immediately, check your heartbeat cabling and llt setup.

Try starting gab, if it's down, with:

host1 # /etc/init.d/gab start

If you're running the command on a node that isn't operational, gab won't be seeded, which means you'll need to force it, like so:

host1 # /sbin/gabconfig -x

3. Check on LLT, now, since there may be something wrong there (even though it wasn't indicated above)

LLT will most obviously present as a crucial part of the problem if your "hastatus -summary" gives you a message that it "can't connect to the server." This will prompt you to check all cluster communication mechanisms (some of which we've already covered).

First, bang out a quick:

host1 # lltconfig

on the command line to see if llt is running at all.

If llt isn't running, be sure to check your console, system messages file (syslog, possibly messages and any logs in /var/log/VRTSvcs/... - usually the "engine log" is worth a quick look) As a rule, I usually do

host1 # ls -tr

when I'm in the VCS log directory to see which log got written to last, and work backward from there. This puts the most recently updated file last in the listing. My assumption is that any pertinent errors got written to one of the fresher log files :) Look in these logs for any messages about bad llt configurations or files, such as /etc/llttab, /etc/llthost and /etc/VRTSvcs/conf/sysname. Also, make sure those three files contain valid entries that "match" <-- This is very important. If you refer to the same facility by 3 different names, even though they all point back to the same IP, VCS can become addled and drop-the-ball.

Examples of invalid entries in LLT config files would include "node numbers" outside the range of 0 to 31 and "cluster numbers" outside the range of 0 to 255.

Now, if LLT "is" running, check its status, like so:

host # lltstat -wn <-- This will let you know if llt on the separate nodes within the cluster can communicate with one another.

Of course, verify physical connections, as well. Also, see our previous post on dlpiping for more low-level-connection VCS troubleshooting tips.

Ex:

host1 # lltstat -vvn
LLT node information:
Node State Link Status Address
0 prsbn012 OPEN
ce0 DOWN
ce1 DOWN
HB172.1 UP 00:03:BA:9D:57:91
HB172.2 UP 00:03:BA:0E:F1:DE
HB173.1 UP 00:03:BA:9D:57:92
HB173.2 UP 00:03:BA:0E:D0:BE
1 prsbn015 OPEN
ce3 UP 00:03:BA:0E:CE:09
ce5 UP 00:03:BA:0E:F4:6B
HB172.1 UP 00:03:BA:9D:5C:69
HB172.2 UP 00:03:BA:0E:CE:08
HB173.1 UP 00:03:BA:0E:F4:6A
HB173.2 UP 00:03:BA:9D:5C:6A


host1 # cat /etc/llttab <-- pardon the lack of low-pri links. We had to build this cluster on the cheap ;)

set-node /etc/VRTSvcs/conf/sysname
set-cluster 100
link ce0 /dev/ce:0 - ether 0x1051 -
link ce1 /dev/ce:1 - ether 0x1052 -
exclude 7-31
host1 # cat /etc/llthosts
0 host1
1 host2
host1 # cat /etc/VRTSvcs/conf/sysname
host1


If llt is down, or you think it might be the problem, either start it or restart it with:

host1 # /etc/init.d/llt.rc start

or

host1 # /etc/init.d/llt.rc stop
host1 # /etc/init.d/llt.rc start


And, that's where we'll end it today. There's still a lot more to cover (we haven't even given the logs more than their minimum due), but that's for next week.

Until then, have a pleasant and relaxing weekend :)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.