The Linux and Unix Menagerie: llt

Friday, November 14, 2008

Basic Veritas Cluster Server Troubleshooting

Hey There,

For the end of the week, we're going to continue with the theme of sparse-but-hopefully useful information. Quick little "crib sheets" (preceding by paragraphs and paragraphs of stilted ramblings by the lunatic who pens this blog's content ;) For this Friday, we're going to come back around and take a look at Veritas Cluster Server (VCS) troubleshooting. If you're interested in more specific examples of problems, solutions and suggestions, with regards to VCS, check out all the VCS related posts from the past year or so. Hopefully you'll be able to find something useful in our archives, as well. These simple suggestions should work equally well for Unix as well as Linux, if you choose to go the VCS route rather than some less costly one :)

And, here we go again; quick, pointed bullets of info. Bite-sized bits of troubleshooting advice that focus on solving the problem, rather than understanding it. That sounds awful, I know, but, sometimes, you have to get things done and, let's face it, if it's the job or your arse, who cares about the why? Leave that for philosophers and academics. Plus, since you fix problems so fast, you'll have plenty of time to read up on the ramifications of your actions later ;)

The setup: Your site is down. It's a small cluster configuration with only two nodes and redundant nic's, attached network disk, etc. All you know is that the problem is with VCS (although it's probably indirectly due to a hardware issue). Something has gone wrong with VCS and it's, obviously, not responding correctly to whatever terrible accident of nature has occurred. You don't have much more to go on than that. The person you receive your briefing from thinks the entire clustered server set up (hardware, software, cabling, power, etc) is a bookmark in IE ;)

Now, one by one, in a fashion that zigs on purpose, but has a tendency to zag, here are a few things to look at right off the bat when assessing a situation like this one. Perhaps next week, we'll look into more advanced troubleshooting (and, of course, you can find lots of specific "weird VCS problem" solutions in our VCS archives)

1. Check if the cluster is working at all.

Log into one of the cluster nodes as root (or a user with equivalent privilege - who shouldn't exist ;) and run

host1 # hastatus –summary

or

host1 # hasum <-- both do the same thing, basically

Ex:

host1 # hastatus -summary

-- SYSTEM STATE
-- System               State                Frozen

A  host1             RUNNING              0
A  host2             RUNNING              0

-- GROUP STATE
-- Group           System               Probed     AutoDisabled    State

B  ClusterService  host1            Y          N               OFFLINE
B  ClusterService  host2            Y          N               ONLINE
B  SG_NIC          host1             Y          N               ONLINE
B  SG_NIC          host2            Y          N               OFFLINE
B  SG_ONE     host1            Y          N               ONLINE
B  SG_ONE     host2             Y          N               OFFLINE
B  SG_TWO     host1             Y          N               OFFLINE
B  SG_TWO     host2             Y          N               OFFLINE

Clearly, your situation is bad: A normal VCS status should indicate that all nodes in the cluster are “RUNNING” (which these are). However, it should also show all service groups as being ONLINE on at least one of the nodes, which isn't the case above with SG_TWO (Service Group 2).

2. Check for cluster communication problems. Here we want to determine if a service group is failing because of any heartbeat failure (The VCS cluster, that is, not another administrator ;)

Check on GAB first, by running:

host1 # gabconfig -a

Ex:

host1 # gabconfig -a
GAB Port Memberships
===============================================================
Port a gen 3a1501 membership 01
Port h gen 3a1505 membership 01

This output is okay. You would know you had a problem at this point if any of the following conditions were true:

if no port “a” memberships were present (0 and 1 above), this could indicate a problem with gab or llt (Looked at next)

If no port "h" memberships were present (0 and 1 above), this could indicate a problem with had.

If starting llt causes it to stop immediately, check your heartbeat cabling and llt setup.

Try starting gab, if it's down, with:

host1 # /etc/init.d/gab start

If you're running the command on a node that isn't operational, gab won't be seeded, which means you'll need to force it, like so:

host1 # /sbin/gabconfig -x

3. Check on LLT, now, since there may be something wrong there (even though it wasn't indicated above)

LLT will most obviously present as a crucial part of the problem if your "hastatus -summary" gives you a message that it "can't connect to the server." This will prompt you to check all cluster communication mechanisms (some of which we've already covered).

First, bang out a quick:

host1 # lltconfig

on the command line to see if llt is running at all.

If llt isn't running, be sure to check your console, system messages file (syslog, possibly messages and any logs in /var/log/VRTSvcs/... - usually the "engine log" is worth a quick look) As a rule, I usually do

host1 # ls -tr

when I'm in the VCS log directory to see which log got written to last, and work backward from there. This puts the most recently updated file last in the listing. My assumption is that any pertinent errors got written to one of the fresher log files :) Look in these logs for any messages about bad llt configurations or files, such as /etc/llttab, /etc/llthost and /etc/VRTSvcs/conf/sysname. Also, make sure those three files contain valid entries that "match" <-- This is very important. If you refer to the same facility by 3 different names, even though they all point back to the same IP, VCS can become addled and drop-the-ball.

Examples of invalid entries in LLT config files would include "node numbers" outside the range of 0 to 31 and "cluster numbers" outside the range of 0 to 255.

Now, if LLT "is" running, check its status, like so:

host # lltstat -wn <-- This will let you know if llt on the separate nodes within the cluster can communicate with one another.

Of course, verify physical connections, as well. Also, see our previous post on dlpiping for more low-level-connection VCS troubleshooting tips.

Ex:

host1 # lltstat -vvn
LLT node information:
Node State Link Status Address
0 prsbn012 OPEN
ce0 DOWN
ce1 DOWN
HB172.1 UP 00:03:BA:9D:57:91
HB172.2 UP 00:03:BA:0E:F1:DE
HB173.1 UP 00:03:BA:9D:57:92
HB173.2 UP 00:03:BA:0E:D0:BE
1 prsbn015 OPEN
ce3 UP 00:03:BA:0E:CE:09
ce5 UP 00:03:BA:0E:F4:6B
HB172.1 UP 00:03:BA:9D:5C:69
HB172.2 UP 00:03:BA:0E:CE:08
HB173.1 UP 00:03:BA:0E:F4:6A
HB173.2 UP 00:03:BA:9D:5C:6A

host1 # cat /etc/llttab <-- pardon the lack of low-pri links. We had to build this cluster on the cheap ;)

set-node /etc/VRTSvcs/conf/sysname
set-cluster 100
link ce0 /dev/ce:0 - ether 0x1051 -
link ce1 /dev/ce:1 - ether 0x1052 -
exclude 7-31
host1 # cat /etc/llthosts
0 host1
1 host2
host1 # cat /etc/VRTSvcs/conf/sysname
host1

If llt is down, or you think it might be the problem, either start it or restart it with:

host1 # /etc/init.d/llt.rc start

or

host1 # /etc/init.d/llt.rc stop
host1 # /etc/init.d/llt.rc start

And, that's where we'll end it today. There's still a lot more to cover (we haven't even given the logs more than their minimum due), but that's for next week.

Until then, have a pleasant and relaxing weekend :)

Cheers,

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Friday, May 30, 2008

Troubleshooting Veritas Cluster Server LLT Issues On Linux and Unix

Hey There,

Today's post is going to steer away from the Linux and/or Unix Operating Systems just slightly, and look at a problem a lot of folks run into, but have problems diagnosing, when they first set up a Veritas cluster.

Our only assumptions for this post are that Veritas Cluster Server is installed correctly on a two-node farm, everything is set up to failover and switch correctly in the software and no useful information can be obtained via the standard Veritas status commands (or, in other words, the software thinks everything's fine, yet it's reporting that it's not working correctly ;)

Generally, with issues like this one (the software being unable to diagnose its own condition), the best place to start is at the lowest level. So, we'll add the fact that the physical network cabling and connections have been checked to our list of assumptions.

Our next step would be to take a look at the next layer up on the protocol stack, which would be the LLT (low latency transport protocol) layer (which, coincidentally, shares the same level as the MAC, so you may see it referred to, elsewhere, as MAC/LLT, or just MAC, when LLT is actually meant!) This is the base layer at which Veritas controls how it sends its heartbeat signals.

The layer-2 LLT protocol is most commonly associated with the DLPI (all these initials... man. These stand for the Data Link Provider Interface). Which brings us around to the point of this post ;)

Veritas Cluster Server comes with a utility called "dlpiping" that will specifically test device-to-device (basically NIC-to-NIC or MAC-to-MAC) communication at the LLT layer. Note that if you can't find the dlpiping command, it comes standard as a component in the VRTSllt package and is generally placed in /opt/VRTSllt/ by default. If you want to use it without having to type the entire command, you can just add that directory to your PATH environment variable by typing:

host # PATH=$PATH:/opt/VRTSllt;export PATH

In order to use dlpiping to troubleshoot this issue, you'll need to set up a dlpiping server on at least one node in the cluster. Since we only have two nodes in our imaginary cluster, having it on only one node should be perfect.

To set up the dlpiping server on either node, type the following at the command prompt (unless otherwise noted, all of these Veritas-specific commands are in /opt/VRTSllt and all system information returned, by way of example here, is intentionally bogus):

host # getmac /dev/ce:0 <--- This will give use the MAC address of the NIC we want to set the server up on (ce0, in this instance). For this command, even if your device is actually named ce0, eth0, etc, you need to specify it as "device:instance"
/dev/ce:0 00:00:00:FF:FF:FF

Next, you just have to start it up and configure it slightly, like so (Easy peasy; you're done :)

host # dlpiping -s /dev/ce:0

This command runs in the foreground by default. You can background it if you like, but once you start it running on whichever node you start it on, you're better off leaving that system alone so that anything else you do on it can't possibly affect the outcome of your tests. Since our pretend machine's cluster setup is completely down right now anyway, we'll just let it run in the foreground. You can stop the server, at any time, by simply typing a ctl-C:

^C
host #

Now, on every other server in the cluster, you'll need to run the dlpiping client. We only have one other server in our cluster, but you would, theoretically, repeat this process as many times as necessary; once for each client. Note, also, that for the dlpiping server and client setups, you should repeat the setup-and-test process for at least one NIC on every node in the cluster that forms a distinct heartbeat-chain. You can determine which NIC's these are by looking in the /etc/llttab file.

host # dlpiping -c /dev/ce:0 00:00:00:FF:FF:FF <--- This is the exact output from the getmac command we issued on the dlpiping server host.

If everything is okay with that connection, you'll see a response akin to a Solaris ping reply:

0:00:00:FF:FF:FF is alive

If something is wrong, the output is equally simple to decipher:

no response from 00:00:00:FF:FF:FF

Assuming everything is okay, and you still have problems, you should check out the support site for Veritas Cluster Server and see what they recommend you try next (most likely testing the IP layer functionality - ping! ;)

If things don't work out, and you get the error, that's great (assuming you're a glass-half-full kind of person ;) Getting an error at this layer of the stack greatly reduces the possible-root-cause pool and leaves you with only a few options that are worth looking into. And, since we've already verified physical cabling connectivity (no loose or poorly fitted ethernet cabling in any NIC) and traced the cable (so we know NICA-1 is going to NICB-1, as it should), you can be almost certain that the issue is with the quality or type of your ethernet cabling.

For instance, your cable may be physically damaged or improperly pinned-out (assuming you make your own cables and accidentally made a bad one - mass manufacturers make mistakes, too, though). Also, you may be using a standard ethernet cable, where a crossover (or, in some instances, rollover) cable is required. Of course, whenever you run into a seeming dead-end like this, double check your Veritas Cluster main.cf file to make sure that it's not in any way related to a slight error that you may have missed earlier on in the process.

In any event, you are now very close to your solution. You can opt to leave your dlpiping server running for as long as you want. To my knowledge it doesn't cause any latency issues that are noticeable (at least in clusters with a small number of nodes). Once you've done your testing, however, it's also completely useless unless you enjoy running that command a lot ;)

Cheers,

, Mike