The Linux and Unix Menagerie: troubleshoot

Tuesday, November 25, 2008

Quick And Easy Local Filesystem Troubleshooting For SUSE Linux

Hey There,

Today we're going to take a look at some quick and easy ways to determine if you have a problem with your local filesystem on SUSE Linux (tested on 8.x and 9.x). Of course, we're assuming that you have some sort of an i/o wait issue and the users are blaming it on the local disk. While it's not always the case (i/o wait can occur because of CPU, memory and even network latency), it never hurts to be able to put out a fire when you need to. And, when the mob's pounding on your door with lit torches, that analogue is never more appropriate ;)

Just as in previous "quick troubleshooting" posts, like the week before last's on basic VCS troubleshooting, we'll be running through this with quick bullets. This won't be too in-depth, but it should cover the basics.

1. Figure out where you are and what OS you're on:

Generally, something as simple as:

host # uname -a

will get you the info you need. For instance, with SUSE Linux (and most others), you'll get output like:

Linux “hostname” kernel-version blah..blah Date Architecture... yada yada

The kernel version in that string is your best indicator. Generally, a kernel-version starting with 2.4.x will be fore SUSE 8.x and 2.6.x will be for SUSE 9.x. Of course, also avail yourselves of the, possibly available, /etc/release, /etc/issue, /etc/motd, /etc/issue.net files and others like them. It's important that you know what you're working with when you get started. Even if it doesn't matter now, it might later :)

2. Figure out how many local disks and/or volume groups you have active and running on your system:

Determine your server model, number of disks and volume groups. Since you're on SUSE, you may as well use the "hwinfo" command. I never know how much I'm going to need to know about the system when I first tackle a problem, so I'll generally dump it all into a file and then extract it from there as needed. See our really old post for a script that Lists out hardware information on SUSE Linux in a more pleasant format:

host # hwinfo >/var/tmp/hwinfo.out
host # grep system.product /var/tmp/hwinfo.out
system.product = 'ProLiant DL380 G4'

Now, I know what I'm working with. If this specific of a grep doesn't work for you, try "grep -i product" - you'll get a lot more information than you need, but your machine's model and number will be in there and much easier to find than if you looked through the entire output file.

Then, go ahead and check out /proc/partitions. This will give you the layout of your disk:

host # /proc # cat /proc/partitions
major minor  #blocks  name

 104     0   35561280 cciss/c0d0
 104     1     265041 cciss/c0d0p1
 104     2   35294805 cciss/c0d0p2
 104    16   35561280 cciss/c0d1
 104    17   35559846 cciss/c0d1p1
 253     0    6291456 dm-0
 253     1    6291456 dm-1
 253     2    2097152 dm-2
 253     3    6291456 dm-3
 253     4   10485760 dm-4
 253     5    3145728 dm-5
 253     6    2097152 dm-6

"cciss/c0d0" and "cciss/c0d1" show you that you have two disks (most probably mirrored, which we can infer from the dm-x output). Depending upon how your local disk is managed, you may see lines that indicate, clearly, that LVM is being used to manage the disk (because the lines contain hints like "lvma," "lvmb" and so forth ;)

58 0 6291456 lvma 0 0 0 0 0 0 0 0 0 0 0
58 1 6291456 lvmb 0 0 0 0 0 0 0 0 0 0 0

3. Check out your local filesystems and fix anything you find that's broken:

Although it's boring, and manual, it's a good idea do take the output of:

host # df -l

and compare that with the contents of your /etc/fstab. This will clear up any obvious errors like mounts that are supposed to be up but aren't or mounts that aren't supposed to up that are, etc... You can whittle down your output from /etc/fstab to show (mostly) only local filesystems by doing a reverse grep on the colon character (:) - This is generally found in remote mounts and almost never found in local filesystem listings.

host # grep -v ":" /etc/fstab

4. Keep hammering away at the obvious:

Check the USED% column in the output of your "df -l" command. If any filesystems are at 100%, some cleanup is in order. It may seem silly, but usually the simplest problems get missed when one too many managers begin breathing down your neck ;) Also, check the inodes column and ensure that those aren't all being used up either.

Mount any filesystems that are supposed to be mounted but aren't, and unmount any filesystems that are mounted but (according to /etc/fstab) shouldn't be). Someone will complain about the latter at some point (almost guaranteed), which will put you in a perfect position to request that it either be put in the /etc/fstab file or not mounted at all.

You're most likely to have an issue here with mounting the unmounted filesystem that's supposed to be mounted. If you try to mount and get an error that indicates the mountpoint can't be found in /etc/fstab or /etc/mnttab, the mount probably isn't listed in /etc/fstab or there is an issue with the syntax of that particular line (could even be a "ghost" control character). You should also check to make sure the mount point being referenced actually exists, although you should get an entirely different (and very self-explanatory) error message in the event that you have that problem.

If you still can't mount, after correcting any of these errors (of course, you could always avoid the previous step and mount from the command line using logical device names instead of paths from /etc/vfstab, but it's always nice to know that what you fix will probably stay fixed for a while ;), you may need to "fix" the disk. This will range in complexity from the very simple to the moderately un-simple ;) The simple (Note: If you're running ReiserFS, use reiserfsck instead of plain fsck for all the following examples. I'm just trying to save myself some typing):

host # umount /uselessFileSystem
host # fsck -y /uselessFileSystem
....
host # mount /

which, you may note, would be impossible to do (or, I should say, I'd highly recommend you DON'T do) on used-and-mounted filesystems or any special filesystems, like root "/" - In cases like that, if you need to fsck the filesystem, you should optimally do it when booted up off of a cdrom or, at the very least, in single user mode (although you still run a risk if you run fsck against a mounted root filesystem).

For the moderately un-simple, we'll assume a "managed file system," like one under LVM control. In this case you could check a volume that refuses to mount (assuming you tried most of the other stuff above and it didn't do you any good) by first scanning all of them (just in case):

host # vgscan
Reading all physical volumes. This may take a while...
Found volume group "usvol" using metadata type lvm2
Found volume group "themvol" using metadata type lvm2

If "usvol" (or any of them) is showing up as inactive, or is completely missing from your output, you can try the following:

host # vgchange –a y

to use the brute-force method of trying to activate all volume groups that are either missing or inactive. If this command gives you errors, or it doesn't and vgscan still gives you errors, you most likely have a hardware related problem. Time to walk over to the server room and check out the situation more closely. Look for amber lights on most servers. I've yet to work on one where "green" meant trouble ;)

If doing the above sorts you out and fixes you up, you just need to scan for logical volumes within the volume group, like so:

host # lvscan
ACTIVE '/dev/usvol/usfs02' [32.00 GB] inherit
....

And (is this starting to sound familiar or am I just repeating myself ;), if this gives you errors, try:

host # lvchange –a y

If the logical volume throws you into an error loop, or it doesn't complain but a repeated run of "lvscan" fails, you've got a problem outside the scope of this post. But, at least you know pretty much where it is!

If you manage to make it through the logical volume scan, and everything seems okay, you just need to remount the filesystem as you normally would. Of course, that could also fail... (Does the misery never end? ;)

At that point, give fsck (or reiserfsck) another shot and, if it doesn't do any good, you'll have to dig deeper and look at possible filesystem corruption so awful you may as well restore the server from a backup or server image (ghost).

And, that's that! Hopefully it wasn't too much or too little, and helps you out in one way or another :)

Cheers,

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Friday, May 30, 2008

Troubleshooting Veritas Cluster Server LLT Issues On Linux and Unix

Hey There,

Today's post is going to steer away from the Linux and/or Unix Operating Systems just slightly, and look at a problem a lot of folks run into, but have problems diagnosing, when they first set up a Veritas cluster.

Our only assumptions for this post are that Veritas Cluster Server is installed correctly on a two-node farm, everything is set up to failover and switch correctly in the software and no useful information can be obtained via the standard Veritas status commands (or, in other words, the software thinks everything's fine, yet it's reporting that it's not working correctly ;)

Generally, with issues like this one (the software being unable to diagnose its own condition), the best place to start is at the lowest level. So, we'll add the fact that the physical network cabling and connections have been checked to our list of assumptions.

Our next step would be to take a look at the next layer up on the protocol stack, which would be the LLT (low latency transport protocol) layer (which, coincidentally, shares the same level as the MAC, so you may see it referred to, elsewhere, as MAC/LLT, or just MAC, when LLT is actually meant!) This is the base layer at which Veritas controls how it sends its heartbeat signals.

The layer-2 LLT protocol is most commonly associated with the DLPI (all these initials... man. These stand for the Data Link Provider Interface). Which brings us around to the point of this post ;)

Veritas Cluster Server comes with a utility called "dlpiping" that will specifically test device-to-device (basically NIC-to-NIC or MAC-to-MAC) communication at the LLT layer. Note that if you can't find the dlpiping command, it comes standard as a component in the VRTSllt package and is generally placed in /opt/VRTSllt/ by default. If you want to use it without having to type the entire command, you can just add that directory to your PATH environment variable by typing:

host # PATH=$PATH:/opt/VRTSllt;export PATH

In order to use dlpiping to troubleshoot this issue, you'll need to set up a dlpiping server on at least one node in the cluster. Since we only have two nodes in our imaginary cluster, having it on only one node should be perfect.

To set up the dlpiping server on either node, type the following at the command prompt (unless otherwise noted, all of these Veritas-specific commands are in /opt/VRTSllt and all system information returned, by way of example here, is intentionally bogus):

host # getmac /dev/ce:0 <--- This will give use the MAC address of the NIC we want to set the server up on (ce0, in this instance). For this command, even if your device is actually named ce0, eth0, etc, you need to specify it as "device:instance"
/dev/ce:0 00:00:00:FF:FF:FF

Next, you just have to start it up and configure it slightly, like so (Easy peasy; you're done :)

host # dlpiping -s /dev/ce:0

This command runs in the foreground by default. You can background it if you like, but once you start it running on whichever node you start it on, you're better off leaving that system alone so that anything else you do on it can't possibly affect the outcome of your tests. Since our pretend machine's cluster setup is completely down right now anyway, we'll just let it run in the foreground. You can stop the server, at any time, by simply typing a ctl-C:

^C
host #

Now, on every other server in the cluster, you'll need to run the dlpiping client. We only have one other server in our cluster, but you would, theoretically, repeat this process as many times as necessary; once for each client. Note, also, that for the dlpiping server and client setups, you should repeat the setup-and-test process for at least one NIC on every node in the cluster that forms a distinct heartbeat-chain. You can determine which NIC's these are by looking in the /etc/llttab file.

host # dlpiping -c /dev/ce:0 00:00:00:FF:FF:FF <--- This is the exact output from the getmac command we issued on the dlpiping server host.

If everything is okay with that connection, you'll see a response akin to a Solaris ping reply:

0:00:00:FF:FF:FF is alive

If something is wrong, the output is equally simple to decipher:

no response from 00:00:00:FF:FF:FF

Assuming everything is okay, and you still have problems, you should check out the support site for Veritas Cluster Server and see what they recommend you try next (most likely testing the IP layer functionality - ping! ;)

If things don't work out, and you get the error, that's great (assuming you're a glass-half-full kind of person ;) Getting an error at this layer of the stack greatly reduces the possible-root-cause pool and leaves you with only a few options that are worth looking into. And, since we've already verified physical cabling connectivity (no loose or poorly fitted ethernet cabling in any NIC) and traced the cable (so we know NICA-1 is going to NICB-1, as it should), you can be almost certain that the issue is with the quality or type of your ethernet cabling.

For instance, your cable may be physically damaged or improperly pinned-out (assuming you make your own cables and accidentally made a bad one - mass manufacturers make mistakes, too, though). Also, you may be using a standard ethernet cable, where a crossover (or, in some instances, rollover) cable is required. Of course, whenever you run into a seeming dead-end like this, double check your Veritas Cluster main.cf file to make sure that it's not in any way related to a slight error that you may have missed earlier on in the process.

In any event, you are now very close to your solution. You can opt to leave your dlpiping server running for as long as you want. To my knowledge it doesn't cause any latency issues that are noticeable (at least in clusters with a small number of nodes). Once you've done your testing, however, it's also completely useless unless you enjoy running that command a lot ;)

Cheers,

, Mike