Tuesday, November 25, 2008

Quick And Easy Local Filesystem Troubleshooting For SUSE Linux

Hey There,

Today we're going to take a look at some quick and easy ways to determine if you have a problem with your local filesystem on SUSE Linux (tested on 8.x and 9.x). Of course, we're assuming that you have some sort of an i/o wait issue and the users are blaming it on the local disk. While it's not always the case (i/o wait can occur because of CPU, memory and even network latency), it never hurts to be able to put out a fire when you need to. And, when the mob's pounding on your door with lit torches, that analogue is never more appropriate ;)

Just as in previous "quick troubleshooting" posts, like the week before last's on basic VCS troubleshooting, we'll be running through this with quick bullets. This won't be too in-depth, but it should cover the basics.

1. Figure out where you are and what OS you're on:

Generally, something as simple as:

host # uname -a

will get you the info you need. For instance, with SUSE Linux (and most others), you'll get output like:

Linux “hostname” kernel-version blah..blah Date Architecture... yada yada

The kernel version in that string is your best indicator. Generally, a kernel-version starting with 2.4.x will be fore SUSE 8.x and 2.6.x will be for SUSE 9.x. Of course, also avail yourselves of the, possibly available, /etc/release, /etc/issue, /etc/motd, /etc/issue.net files and others like them. It's important that you know what you're working with when you get started. Even if it doesn't matter now, it might later :)

2. Figure out how many local disks and/or volume groups you have active and running on your system:

Determine your server model, number of disks and volume groups. Since you're on SUSE, you may as well use the "hwinfo" command. I never know how much I'm going to need to know about the system when I first tackle a problem, so I'll generally dump it all into a file and then extract it from there as needed. See our really old post for a script that Lists out hardware information on SUSE Linux in a more pleasant format:

host # hwinfo >/var/tmp/hwinfo.out
host # grep system.product /var/tmp/hwinfo.out
system.product = 'ProLiant DL380 G4'

Now, I know what I'm working with. If this specific of a grep doesn't work for you, try "grep -i product" - you'll get a lot more information than you need, but your machine's model and number will be in there and much easier to find than if you looked through the entire output file.

Then, go ahead and check out /proc/partitions. This will give you the layout of your disk:

host # /proc # cat /proc/partitions
major minor #blocks name

104 0 35561280 cciss/c0d0
104 1 265041 cciss/c0d0p1
104 2 35294805 cciss/c0d0p2
104 16 35561280 cciss/c0d1
104 17 35559846 cciss/c0d1p1
253 0 6291456 dm-0
253 1 6291456 dm-1
253 2 2097152 dm-2
253 3 6291456 dm-3
253 4 10485760 dm-4
253 5 3145728 dm-5
253 6 2097152 dm-6

"cciss/c0d0" and "cciss/c0d1" show you that you have two disks (most probably mirrored, which we can infer from the dm-x output). Depending upon how your local disk is managed, you may see lines that indicate, clearly, that LVM is being used to manage the disk (because the lines contain hints like "lvma," "lvmb" and so forth ;)

58 0 6291456 lvma 0 0 0 0 0 0 0 0 0 0 0
58 1 6291456 lvmb 0 0 0 0 0 0 0 0 0 0 0

3. Check out your local filesystems and fix anything you find that's broken:

Although it's boring, and manual, it's a good idea do take the output of:

host # df -l

and compare that with the contents of your /etc/fstab. This will clear up any obvious errors like mounts that are supposed to be up but aren't or mounts that aren't supposed to up that are, etc... You can whittle down your output from /etc/fstab to show (mostly) only local filesystems by doing a reverse grep on the colon character (:) - This is generally found in remote mounts and almost never found in local filesystem listings.

host # grep -v ":" /etc/fstab

4. Keep hammering away at the obvious:

Check the USED% column in the output of your "df -l" command. If any filesystems are at 100%, some cleanup is in order. It may seem silly, but usually the simplest problems get missed when one too many managers begin breathing down your neck ;) Also, check the inodes column and ensure that those aren't all being used up either.

Mount any filesystems that are supposed to be mounted but aren't, and unmount any filesystems that are mounted but (according to /etc/fstab) shouldn't be). Someone will complain about the latter at some point (almost guaranteed), which will put you in a perfect position to request that it either be put in the /etc/fstab file or not mounted at all.

You're most likely to have an issue here with mounting the unmounted filesystem that's supposed to be mounted. If you try to mount and get an error that indicates the mountpoint can't be found in /etc/fstab or /etc/mnttab, the mount probably isn't listed in /etc/fstab or there is an issue with the syntax of that particular line (could even be a "ghost" control character). You should also check to make sure the mount point being referenced actually exists, although you should get an entirely different (and very self-explanatory) error message in the event that you have that problem.

If you still can't mount, after correcting any of these errors (of course, you could always avoid the previous step and mount from the command line using logical device names instead of paths from /etc/vfstab, but it's always nice to know that what you fix will probably stay fixed for a while ;), you may need to "fix" the disk. This will range in complexity from the very simple to the moderately un-simple ;) The simple (Note: If you're running ReiserFS, use reiserfsck instead of plain fsck for all the following examples. I'm just trying to save myself some typing):

host # umount /uselessFileSystem
host # fsck -y /uselessFileSystem
host # mount /

which, you may note, would be impossible to do (or, I should say, I'd highly recommend you DON'T do) on used-and-mounted filesystems or any special filesystems, like root "/" - In cases like that, if you need to fsck the filesystem, you should optimally do it when booted up off of a cdrom or, at the very least, in single user mode (although you still run a risk if you run fsck against a mounted root filesystem).

For the moderately un-simple, we'll assume a "managed file system," like one under LVM control. In this case you could check a volume that refuses to mount (assuming you tried most of the other stuff above and it didn't do you any good) by first scanning all of them (just in case):

host # vgscan
Reading all physical volumes. This may take a while...
Found volume group "usvol" using metadata type lvm2
Found volume group "themvol" using metadata type lvm2

If "usvol" (or any of them) is showing up as inactive, or is completely missing from your output, you can try the following:

host # vgchange –a y

to use the brute-force method of trying to activate all volume groups that are either missing or inactive. If this command gives you errors, or it doesn't and vgscan still gives you errors, you most likely have a hardware related problem. Time to walk over to the server room and check out the situation more closely. Look for amber lights on most servers. I've yet to work on one where "green" meant trouble ;)

If doing the above sorts you out and fixes you up, you just need to scan for logical volumes within the volume group, like so:

host # lvscan
ACTIVE '/dev/usvol/usfs02' [32.00 GB] inherit

And (is this starting to sound familiar or am I just repeating myself ;), if this gives you errors, try:

host # lvchange –a y

If the logical volume throws you into an error loop, or it doesn't complain but a repeated run of "lvscan" fails, you've got a problem outside the scope of this post. But, at least you know pretty much where it is!

If you manage to make it through the logical volume scan, and everything seems okay, you just need to remount the filesystem as you normally would. Of course, that could also fail... (Does the misery never end? ;)

At that point, give fsck (or reiserfsck) another shot and, if it doesn't do any good, you'll have to dig deeper and look at possible filesystem corruption so awful you may as well restore the server from a backup or server image (ghost).

And, that's that! Hopefully it wasn't too much or too little, and helps you out in one way or another :)


, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.