Friday, May 30, 2008

Troubleshooting Veritas Cluster Server LLT Issues On Linux and Unix

Hey There,

Today's post is going to steer away from the Linux and/or Unix Operating Systems just slightly, and look at a problem a lot of folks run into, but have problems diagnosing, when they first set up a Veritas cluster.

Our only assumptions for this post are that Veritas Cluster Server is installed correctly on a two-node farm, everything is set up to failover and switch correctly in the software and no useful information can be obtained via the standard Veritas status commands (or, in other words, the software thinks everything's fine, yet it's reporting that it's not working correctly ;)

Generally, with issues like this one (the software being unable to diagnose its own condition), the best place to start is at the lowest level. So, we'll add the fact that the physical network cabling and connections have been checked to our list of assumptions.

Our next step would be to take a look at the next layer up on the protocol stack, which would be the LLT (low latency transport protocol) layer (which, coincidentally, shares the same level as the MAC, so you may see it referred to, elsewhere, as MAC/LLT, or just MAC, when LLT is actually meant!) This is the base layer at which Veritas controls how it sends its heartbeat signals.

The layer-2 LLT protocol is most commonly associated with the DLPI (all these initials... man. These stand for the Data Link Provider Interface). Which brings us around to the point of this post ;)

Veritas Cluster Server comes with a utility called "dlpiping" that will specifically test device-to-device (basically NIC-to-NIC or MAC-to-MAC) communication at the LLT layer. Note that if you can't find the dlpiping command, it comes standard as a component in the VRTSllt package and is generally placed in /opt/VRTSllt/ by default. If you want to use it without having to type the entire command, you can just add that directory to your PATH environment variable by typing:

host # PATH=$PATH:/opt/VRTSllt;export PATH

In order to use dlpiping to troubleshoot this issue, you'll need to set up a dlpiping server on at least one node in the cluster. Since we only have two nodes in our imaginary cluster, having it on only one node should be perfect.

To set up the dlpiping server on either node, type the following at the command prompt (unless otherwise noted, all of these Veritas-specific commands are in /opt/VRTSllt and all system information returned, by way of example here, is intentionally bogus):

host # getmac /dev/ce:0 <--- This will give use the MAC address of the NIC we want to set the server up on (ce0, in this instance). For this command, even if your device is actually named ce0, eth0, etc, you need to specify it as "device:instance"
/dev/ce:0 00:00:00:FF:FF:FF

Next, you just have to start it up and configure it slightly, like so (Easy peasy; you're done :)

host # dlpiping -s /dev/ce:0

This command runs in the foreground by default. You can background it if you like, but once you start it running on whichever node you start it on, you're better off leaving that system alone so that anything else you do on it can't possibly affect the outcome of your tests. Since our pretend machine's cluster setup is completely down right now anyway, we'll just let it run in the foreground. You can stop the server, at any time, by simply typing a ctl-C:

^C
host #


Now, on every other server in the cluster, you'll need to run the dlpiping client. We only have one other server in our cluster, but you would, theoretically, repeat this process as many times as necessary; once for each client. Note, also, that for the dlpiping server and client setups, you should repeat the setup-and-test process for at least one NIC on every node in the cluster that forms a distinct heartbeat-chain. You can determine which NIC's these are by looking in the /etc/llttab file.

host # dlpiping -c /dev/ce:0 00:00:00:FF:FF:FF <--- This is the exact output from the getmac command we issued on the dlpiping server host.

If everything is okay with that connection, you'll see a response akin to a Solaris ping reply:

0:00:00:FF:FF:FF is alive

If something is wrong, the output is equally simple to decipher:

no response from 00:00:00:FF:FF:FF

Assuming everything is okay, and you still have problems, you should check out the support site for Veritas Cluster Server and see what they recommend you try next (most likely testing the IP layer functionality - ping! ;)

If things don't work out, and you get the error, that's great (assuming you're a glass-half-full kind of person ;) Getting an error at this layer of the stack greatly reduces the possible-root-cause pool and leaves you with only a few options that are worth looking into. And, since we've already verified physical cabling connectivity (no loose or poorly fitted ethernet cabling in any NIC) and traced the cable (so we know NICA-1 is going to NICB-1, as it should), you can be almost certain that the issue is with the quality or type of your ethernet cabling.

For instance, your cable may be physically damaged or improperly pinned-out (assuming you make your own cables and accidentally made a bad one - mass manufacturers make mistakes, too, though). Also, you may be using a standard ethernet cable, where a crossover (or, in some instances, rollover) cable is required. Of course, whenever you run into a seeming dead-end like this, double check your Veritas Cluster main.cf file to make sure that it's not in any way related to a slight error that you may have missed earlier on in the process.

In any event, you are now very close to your solution. You can opt to leave your dlpiping server running for as long as you want. To my knowledge it doesn't cause any latency issues that are noticeable (at least in clusters with a small number of nodes). Once you've done your testing, however, it's also completely useless unless you enjoy running that command a lot ;)

Cheers,

, Mike