The Linux and Unix Menagerie: disk

Showing posts with label disk. Show all posts

Tuesday, November 25, 2008

Quick And Easy Local Filesystem Troubleshooting For SUSE Linux

Hey There,

Today we're going to take a look at some quick and easy ways to determine if you have a problem with your local filesystem on SUSE Linux (tested on 8.x and 9.x). Of course, we're assuming that you have some sort of an i/o wait issue and the users are blaming it on the local disk. While it's not always the case (i/o wait can occur because of CPU, memory and even network latency), it never hurts to be able to put out a fire when you need to. And, when the mob's pounding on your door with lit torches, that analogue is never more appropriate ;)

Just as in previous "quick troubleshooting" posts, like the week before last's on basic VCS troubleshooting, we'll be running through this with quick bullets. This won't be too in-depth, but it should cover the basics.

1. Figure out where you are and what OS you're on:

Generally, something as simple as:

host # uname -a

will get you the info you need. For instance, with SUSE Linux (and most others), you'll get output like:

Linux “hostname” kernel-version blah..blah Date Architecture... yada yada

The kernel version in that string is your best indicator. Generally, a kernel-version starting with 2.4.x will be fore SUSE 8.x and 2.6.x will be for SUSE 9.x. Of course, also avail yourselves of the, possibly available, /etc/release, /etc/issue, /etc/motd, /etc/issue.net files and others like them. It's important that you know what you're working with when you get started. Even if it doesn't matter now, it might later :)

2. Figure out how many local disks and/or volume groups you have active and running on your system:

Determine your server model, number of disks and volume groups. Since you're on SUSE, you may as well use the "hwinfo" command. I never know how much I'm going to need to know about the system when I first tackle a problem, so I'll generally dump it all into a file and then extract it from there as needed. See our really old post for a script that Lists out hardware information on SUSE Linux in a more pleasant format:

host # hwinfo >/var/tmp/hwinfo.out
host # grep system.product /var/tmp/hwinfo.out
system.product = 'ProLiant DL380 G4'

Now, I know what I'm working with. If this specific of a grep doesn't work for you, try "grep -i product" - you'll get a lot more information than you need, but your machine's model and number will be in there and much easier to find than if you looked through the entire output file.

Then, go ahead and check out /proc/partitions. This will give you the layout of your disk:

host # /proc # cat /proc/partitions
major minor  #blocks  name

 104     0   35561280 cciss/c0d0
 104     1     265041 cciss/c0d0p1
 104     2   35294805 cciss/c0d0p2
 104    16   35561280 cciss/c0d1
 104    17   35559846 cciss/c0d1p1
 253     0    6291456 dm-0
 253     1    6291456 dm-1
 253     2    2097152 dm-2
 253     3    6291456 dm-3
 253     4   10485760 dm-4
 253     5    3145728 dm-5
 253     6    2097152 dm-6

"cciss/c0d0" and "cciss/c0d1" show you that you have two disks (most probably mirrored, which we can infer from the dm-x output). Depending upon how your local disk is managed, you may see lines that indicate, clearly, that LVM is being used to manage the disk (because the lines contain hints like "lvma," "lvmb" and so forth ;)

58 0 6291456 lvma 0 0 0 0 0 0 0 0 0 0 0
58 1 6291456 lvmb 0 0 0 0 0 0 0 0 0 0 0

3. Check out your local filesystems and fix anything you find that's broken:

Although it's boring, and manual, it's a good idea do take the output of:

host # df -l

and compare that with the contents of your /etc/fstab. This will clear up any obvious errors like mounts that are supposed to be up but aren't or mounts that aren't supposed to up that are, etc... You can whittle down your output from /etc/fstab to show (mostly) only local filesystems by doing a reverse grep on the colon character (:) - This is generally found in remote mounts and almost never found in local filesystem listings.

host # grep -v ":" /etc/fstab

4. Keep hammering away at the obvious:

Check the USED% column in the output of your "df -l" command. If any filesystems are at 100%, some cleanup is in order. It may seem silly, but usually the simplest problems get missed when one too many managers begin breathing down your neck ;) Also, check the inodes column and ensure that those aren't all being used up either.

Mount any filesystems that are supposed to be mounted but aren't, and unmount any filesystems that are mounted but (according to /etc/fstab) shouldn't be). Someone will complain about the latter at some point (almost guaranteed), which will put you in a perfect position to request that it either be put in the /etc/fstab file or not mounted at all.

You're most likely to have an issue here with mounting the unmounted filesystem that's supposed to be mounted. If you try to mount and get an error that indicates the mountpoint can't be found in /etc/fstab or /etc/mnttab, the mount probably isn't listed in /etc/fstab or there is an issue with the syntax of that particular line (could even be a "ghost" control character). You should also check to make sure the mount point being referenced actually exists, although you should get an entirely different (and very self-explanatory) error message in the event that you have that problem.

If you still can't mount, after correcting any of these errors (of course, you could always avoid the previous step and mount from the command line using logical device names instead of paths from /etc/vfstab, but it's always nice to know that what you fix will probably stay fixed for a while ;), you may need to "fix" the disk. This will range in complexity from the very simple to the moderately un-simple ;) The simple (Note: If you're running ReiserFS, use reiserfsck instead of plain fsck for all the following examples. I'm just trying to save myself some typing):

host # umount /uselessFileSystem
host # fsck -y /uselessFileSystem
....
host # mount /

which, you may note, would be impossible to do (or, I should say, I'd highly recommend you DON'T do) on used-and-mounted filesystems or any special filesystems, like root "/" - In cases like that, if you need to fsck the filesystem, you should optimally do it when booted up off of a cdrom or, at the very least, in single user mode (although you still run a risk if you run fsck against a mounted root filesystem).

For the moderately un-simple, we'll assume a "managed file system," like one under LVM control. In this case you could check a volume that refuses to mount (assuming you tried most of the other stuff above and it didn't do you any good) by first scanning all of them (just in case):

host # vgscan
Reading all physical volumes. This may take a while...
Found volume group "usvol" using metadata type lvm2
Found volume group "themvol" using metadata type lvm2

If "usvol" (or any of them) is showing up as inactive, or is completely missing from your output, you can try the following:

host # vgchange –a y

to use the brute-force method of trying to activate all volume groups that are either missing or inactive. If this command gives you errors, or it doesn't and vgscan still gives you errors, you most likely have a hardware related problem. Time to walk over to the server room and check out the situation more closely. Look for amber lights on most servers. I've yet to work on one where "green" meant trouble ;)

If doing the above sorts you out and fixes you up, you just need to scan for logical volumes within the volume group, like so:

host # lvscan
ACTIVE '/dev/usvol/usfs02' [32.00 GB] inherit
....

And (is this starting to sound familiar or am I just repeating myself ;), if this gives you errors, try:

host # lvchange –a y

If the logical volume throws you into an error loop, or it doesn't complain but a repeated run of "lvscan" fails, you've got a problem outside the scope of this post. But, at least you know pretty much where it is!

If you manage to make it through the logical volume scan, and everything seems okay, you just need to remount the filesystem as you normally would. Of course, that could also fail... (Does the misery never end? ;)

At that point, give fsck (or reiserfsck) another shot and, if it doesn't do any good, you'll have to dig deeper and look at possible filesystem corruption so awful you may as well restore the server from a backup or server image (ghost).

And, that's that! Hopefully it wasn't too much or too little, and helps you out in one way or another :)

Cheers,

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Thursday, October 30, 2008

LVM's Roots - Mirroring Your Boot Disk On HP-UX 10 Unix

Hey There,

If you read this blog every once in a while (or if you just happen to have ever searched for - or queried the tag named - LVM in our growing library of questionably-valuable articles ;) you've probably noted that, although mentioned in passing, none of them has ever dealt directly with HP-UX. AIX and, of course, Linux have received their fair share of attention. Even Solaris Volume Manager (or, if you still prefer it, Solstice Disk Suite) and Veritas Volume Manager have been covered in some detail with reference to their similarity to LVM. It's about time that HP-UX (arguably one of the mothers of LVM (or LVM2) as we know it today) should get some sort of treatment. This blog is, after all, dedicated to Linux and Unix (both terms being purposefully generic so we can write about whatever *nix machines we can get our hands on :)

Today's entry is a bit of a quick introduction to HP's Logical Volume Manager and was written specifically for an HP-UX 10.x box. We haven't specifically tested this against 11.x or 11i, but, from our experience working with both, this script should work with little-or-no modification on 11.x. Now that we've got a few HP servers to have fun with (I mean... work really hard on ;), we'll give HP-UX it's due and run through the essentials of LVM. We'll try to make it as short and sweet as possible, while not skimming over the basics, so that the posts themselves can serve as a decent reference for a straight-up HP-UX user. Actually, if you're an HP-UX user (Experience here ranges from 9.x through 11i - Old 800 series K class towers ( with matching WYSE terminals) to SuperDomes and some of the newer 9000 series), you're also well familiar with the huge differences in the basic functionality between versions of the OS and the ISL and GSP/BCH underpinnings (which you could, somewhat, liken to a difference between the Domain Console/System Controller setup on the big Sun 3800 through 25k servers and the newer XCP/XSCF setup on the Mx000 series).

Until that day, here's a little script to help mirror your root disks on HP-UX 10. This was actually tested and used on a K100 Server (Refurbished, of course, but smokin' fast with 4 100 MHz CPU's. Actually, pretty decent once it boots up okay :)

Cheers,

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/sh

#
# hpmirrordisk.sh - Double check this before you run it.  Seriously :)
#
# 2008 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

echo "You have installed the MirrorDisk-HPUX Product already, right?"
echo "If you still need to, hit ctl-C and quit running this script!"
read oblig
echo
echo "This should show only one bootable disk to start out..."
echo
lvlnboot -v
echo
echo "New bootable physical volume name [c?t?d?]"
echo
read disk
pvcreate -B /dev/rdsk/$disk
echo
echo "Volume group you'll be mirroring [vg??]"
echo
read vgroup
vgextend /dev/$vgroup /dev/dsk/$disk
echo
echo "Making Mirror Disk Bootable..."
echo
mkboot /dev/rdsk/$disk
mkboot -a "hpux -lq (;0)/stand/vmunix" /dev/dsk/$disk
echo
echo "This result output should give you back \"hpux -lq (;0)/stand/vmunix\""
echo
lifcp /dev/rdsk/${disk}:AUTO -
echo
echo "Return to continue"
echo
read oblig
echo
echo "Extending all logical volumes..."
echo
for x in 1 2 3 4 5 6 7 8 9
do
 /usr/sbin/lvextend -m 1 /dev/${vgroup}/lvol$x /dev/dsk/$disk
done
echo
echo "This should show two bootable disks now..."
echo
lvlnboot -v
echo
echo "Okay? Hit return"
echo
read oblig
echo
echo "For these tests, the Physical Extents should map to the"
echo "Logical Extents for each logical volume."
echo
for x in 1 2 3 4 5 6 7 8 9
do
echo
 echo "lvol$x"
 echo
 lvdisplay -v /dev/${vgroup}/lvol$x |sed -n '/Distribution/,/Logical/p'|sed '$d'
 echo
 echo "Okay? Hit return"
 echo
 read oblig
done

echo
echo "All you need to do now is reboot and be sure to halt the"
echo "system at the boot menu.  Make sure the Primary Boot Path"
echo "is set to the mirrored disk.  If not, you will have to set"
echo "the correct path in the COnfiguration menu.  Also, under the"
echo "COnfiguration menu, set Auto Search equal to ON."
echo "Finally, be sure to set the Alternate Boot Path to the"
echo "original disk"
echo

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Wednesday, October 8, 2008

Puppy Linux Live Trumps LinuxDefender In More Ways Than One

Hey There,

More than a few people wrote in to let me know about other interesting "live" distro's of Linux after our post on using LinuxDefender Live CD to Fix NTFS problems ran. I've been a bit busy with my son, now that he's started school and isn't completely exhausted at the end of the day (like I always am ;), but I will post each and every comment I received (at least, with the consent of the commenter's) as soon as I can.

The one thing that blew my mind is that the download links for LinuxDefender Live don't even work any more (One thing you should understand about me is that I'm a pack-rat. If I like something, I make 15 copies of it and hide it in several different counties. My LinuxDefender Live CD was like gold). I knew they'd been swallowed up by BitDefender (or am I getting foggy and not remembering them always being under that umbrella? ;), but I had no idea that the entire project had been trashed. This is why I haven't included any links to the distro in this post, so far. For your amusement, here's the still-live page that has all the download links (On BitDefender's site, no less) you'll ever need. Unfortunately, none of them link to any available content. Check it out. Try and download LinuxDefender Live, I dare ya. Why that page hasn't been scrapped, I can't even begin to waste my time trying to understand ;)

In going over all the alternatives (including a very nice and mature Knoppix that now fully supports NTFS read/write), I settled on Puppy Linux. It seemed to have the best frugal-to-usable ratio out there. And, no surprise here, it's very easy to use. You can download your own copy from the Live-Puppy Download Site. It's completely free of charge (although I'm sure they wouldn't sneeze at a donation ;) and very easy to use.

Even more impressive are some of the little things that differ from my now-seemingly-ridiculous LinuxDefender CD. The Live Puppy OS has three main features that make me like it a lot (although these don't encompass them all):

1. It includes support for NTFS = I can fix my kid's computer and not have to reboot Windows for 17 hours just to reload the corrupted VGA driver.

2. It makes excellent use of RAM-loading the OS after boot from the CD = I don't feel like I'm using a read-only OS.

3. It can save to disk without destroying your resident OS = I can save my personalizations, which makes it feel even more natural. Of course, I can't go so far as to download and install packages (unless I've partitioned my disk in preparation, which defeats the purpose ;), but it's still nice to be able to save my little tweaks.

Considering that we've got the same issue to deal with that we did in our LinuxDefender Live post (like the SYSTEM file, in C:\WINNT\ is corrupted and you just need to be able to copy it off and replace it with SYSTEM.BAK), these are the same steps (but fewer) that we'd take to fix up our NTFS Windows box and make everything better:

1. Pop open the CD tray while the system is still powered on. If that doesn't work, power it down and use the pinhole-method (sticking a pin in the hole in the front of the CD-ROM drive to manually eject it). Place the Live-Puppy CD in there and close it back up. Then power up or restart your machine as your situation dictates.

2. However your system allows you to, push the correct button (f1 or maybe f10/f12) when you power up the machine so that you can get to the system settings and make sure that your CD-ROM drive is listed as a Boot Device and is in the Boot Sequence (preferably first) so that our CD will be able to boot the system from the CD-ROM drive.

3. Power on the machine and kick back. Live Puppy is pretty cool to watch if you've never seen it before. It should work without issue on your box. I've heard reports that it even works on a lot of funky custom AlienWare computers!

4. Once you're finished booting up and have either your desktop GUI or the CLI up and running, just mount the Windows hard drive like you'd mount any Linux hard drive, on a temporary mount point. If you prefer to use the GUI, you can mount the disk just like in Windows; no issues!

5. Skipping about 5 steps from the LinuxDefender Live fiasco, you can mount your Windows drive and access it like any regular Linux drive. Again, be sure to pass the options to mount (man mount) to indicate that you want to mount the disk read/write as NTFS. I'm incredibly paranoid, so I just cd directly into the WINNT directory (in this instance), copy off the bad SYSTEM file, copy the SYSTEM.BAK file to SYSTEM, cd back to where I was and umount. Actually, if I was really bad, I'd just use absolute path names ;)

6. Now, you just exit or reboot and remove the Live-Puppy CD (or vice versa). Windows should come right up and run as poorly as it always has ;)

I can't say enough about this distro. Check out Live-Puppy CD for free!. Even if you can't stand "real" live puppies, you're gonna love this one :)

Next up in our "avoiding Windows support" series: How to disappear in the jungles of uncharted Africa ;)

Cheers,

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Tuesday, September 30, 2008

How To Resolve Veritas Disk Group Cluster Volume Management Problems On Linux or Unix

Hey There,

Today we're going to look at an issue that, while it doesn't happen all that often, happens just enough to make it post-worthy. I've only seen it a few times in my "career," but I don't always have access to the fancy software, so this problem may be more widespread than I've been lead to believe ;) The issue we'll deal with today is: What do you do when disk groups, within a cluster, conflict with one another? Or, more correctly, what do you do when disk groups within a cluster conflict with one another even though all the disk is being shared by every node in the cluster? If that still doesn't make sense (and I'm not judging "you," it just doesn't sound right to me, yet ;) what do you do in a situation where every node in a cluster shares a common disk group and, for some bizarre reason, this creates a conflict between nodes in the cluster and some of them refuse to use the disk even though it's supposed to be accessible through every single node? Enough questions... ;)

Check out these links for a smattering of other posts we've done on dealing with Veritas Volume Manager and fussing with Veritas Cluster Server. Some of the material covered may be useful if you have problems with any of the concepts glossed over in the problem resolution at the end.

Like I mentioned, this "does" happen from time to time, and not for the reasons you might generally suspect (like one node having a lock on the disk group and refusing to share, etc). In fact, the reason this happens sometimes (in this very particular case) is quite interesting. Even quite disturbing, since you'd expect that this shouldn't be able to happen.

Here's the setup, and another reason this problem seems kind of confusing. A disk group (we'll call it DiskGroupDG1 because we're all about creativity over here ;) is being shared between 2 nodes in a 2 node cluster. Both nodes have Veritas Cluster Server (VCS) set up correctly and no other problems with Veritas exist. If the DiskGroupDG1 disk group is imported on Node1, using the Cluster Volume Manager (CVM), it can be mounted and accessed by Node2 without any issues. However, if DiskGroupDG1 is imported on Node2, using CVM, it cannot be mounted and/or access by Node1.

All things being equal, this doesn't readily make much sense. There are no disparities between the nodes (insofar as the Veritas Cluster and Volume Management setup are concerned) and things should be just peachy going one way or the other. So, what's the deal, then?

The problem, actually, has very little to do with VCS and/or CVM (Although they're totally relevant and deserve to be in the title of the post -- standard disclaimer ;). The actual issue has to do, mostly, with minor disk numbering on the Node1 and Node2 servers. What???

Here's what happens:
In the first scenario (where everything's hunky and most everything's dorey) the DiskGroupDG1 disk group is imported by CVM on Node1 and Node1 notices that the "minor numbers" of the disks in the disk group are exactly the same as the "minor numbers" on disk it already has mounted locally. You can always tell a disk's (or any other device's) minor number by using the ls command on Linux or Unix, like so:

host # /dev/dsk # ls -ls c0t0d0s0
   2 lrwxrwxrwx   1 root     root          41 May 11  2001 c0t0d0s0 -> ../../devices/pci@1f,4000/scsi@3/sd@0,0:a
host # /dev/dsk # ls -ls ../../devices/pci@1f,4000/scsi@3/sd@0,0:a
   0 brw-r-----   1 root     sys       32,  0 May 11  2001 ../../devices/pci@1f,4000/scsi@3/sd@0,0:a

<-- In this instance, the device's "major number" is 32 and the device's "minor number" is 0. Generally, with virtual disks, etc, you won't see numbers that low.

Now, on Node1, since it recognizes this conflict on import, does what Veritas VM naturally does to avoid conflict; it renumbers the imported volumes ("minor number" only) so that the imported volumes won't conflict with volumes in another disk group that's already resident on the system it's managing. Therefore, when Node2 attempts to mount, with CVM, the command is successful.
In the second scenario (where thing are a little bit hunky, but not at all dorey), Node2 imports the DiskGroupDG1 disk group and none of the minor numbers in that disk group's volumes conflict with any of its local (or already mounted) disk. The disk group volumes are imported with no error, but, the "minor numbers" are not temporarily changed, either. You see where this is going. It's a freakin' train wreck waiting to happen ;)

Now, when Node1 attempts to mount, it determines there's a conflict, but can't renumber the "minor numbers" on the disk group's volumes (since they're already imported and mounted on Node2) and, therefore, takes the only other course of action it can think of and bails completely.

So, how do you get around this for once and all time? Well, I'm not sure it's entirely possible to anticipate this problem with a variable number of nodes in a cluster, all with independent disk groups and, also, sharing volume groups between nodes, although you could take simple measures to prevent it most of the time (like running ls against every volume in every disk group in a cluster every now and again and making sure no conflicts existed. The script should be pretty easy to whip up).
Basically, in this instance (and any like it), the solution involves doing what Veritas VM did in the first scenario; except doing it all-the-way. No temporary-changing of "minor numbers." For our purposes, we'd like to change them permanently, so that they never conflict again! It can be done in a few simple steps.

1. Stop VCS on the problem node first.

2. Stop any applications using the local disk group whose "minor numbers" conflict with the "minor numbers" of the volumes in DiskGroupDG1.

3. Unmount (umount) the filesystems and deport the affected disk group.

4. Now, pick a new "minor number" that won't conflict with the DiskGroupDG1 "minor numbers." Higher is generally better, but I'd check the minor numbers on all the devices in my device tree just to be sure.

5. Run the following command against your local disk group (named, aptly, LocalDG1 ;) :

host # vxdg reminor LocalDG1 3900 <-- Note that this number is the base, so every volume, past the initial, within the disk group will have a "minor number" one integer higher than the last (3900, 3901, etc)

6. Reimport the LocalDG1 disk group

7. Remount your filesystems, restart your applications and restart VCS on the affected node.

8. You don't have to, but I'd do the same thing on all the nodes, if I had a window in which to do it.

And, that would be that. Problem solved.

You may never ever see this issue in your lifetime. But, if you do, hopefully, this page (or one like it) will still be cyber-flotsam on the info-sea ;)

Cheers,

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Wednesday, August 20, 2008

How To Manage Your Disk By UUID On Linux

Hey There,

Today's post is about something I think is pretty cool on Linux (since about kernel 2.1.x, when /proc/partitions was introduced, or made standard). It has to do with disk mounting (both on the command line or through the fstab) by UUID (Universal Unique Identifier). UUID notation, when used as a means to access disk, is just one more way that Linux has moved ahead of the pack to (depending on your way of thinking ;) either make disk management more accessible or make the fstab and disk identification even more confusing ;) NOTE: Skip the next paragraph if you don't care about Open Solaris' slight support for this functionality, to date. Skip to the numbered list if you just want to check out the commands and have had your fair share of my opinion ;)

For those of you who got here by catching the Solaris and Unix tags on this post, I want to address your concerns immediately, since you may no longer be concerned with this text after the next few sentences ;) Although Solaris does "understand" UUID addressing, the level on which Solaris addresses the issue (with regards to disk management) isn't user-friendly enough to fit in the scope of this post. Basically, and this is putting it very generically, the getting and setting of object UUID's on Solaris is still only resident at the code-base layer. I'm glad to see that Sun is addressing the issue with C functions like wsreg_set_id() and wsreg_get_id(), but, since the functionality provided by this layer of access hasn't been implemented in any relevant user tools, we won't be looking at Solaris' implementation of it for the remainder of this dialogue. Ok, I'll give Solaris 10 points for having expanded greatly upon the previous version's acceptance of the standard by implementing a lot of new C routines, a "makeuuid" binary and support for UUID's of zones, but, again, since we're going to be looking at mounting disk using the UUID (without re-writing the OS), Solaris (Open and Regular) is out for now (8/20/2008 just in case the future makes me incorrect, which it has a nasty habit of doing ;)

While Linux boasts most of the same C routines and headers as Solaris (which it must, of course, since the OS supports UUID identification), they're named slightly differently and - the biggest plus - Linux (RedHat and Ubuntu, at least) come with plenty of programs to work with disk UUID's and plenty of hooks to allow other programs to make use of the disk UUID's as well!

The most basic program (that Solaris has picked up on) is called "uuidgen." This program will generate a UUID for you based on the output from a decent randomness-generator (like /dev/random) or resort to time-and-MAC-based randomization ( Generally, the only random factor used is time, unless you have the privilege to view your ethernet adapter's MAC address). The program can be forced to use one or the other, if you have a specific preference (with the "-r" and "-t" flags, respectively). This seemingly extraneous program does have one very important area of application, which we'll look at below.

Where you really see the benefit with Linux is in how they've worked it into their basic hard disk management facilities. They've made it very simple for you to keep track of your disks by UUID using any number of methods. I'll be listing several different means to some "ends" you may want to achieve, as every command may not be available in your Linux distro, but at least one probably is.

One, often unmentioned (but highly valuable), benefit of using UUID's to deal with your disks is that you don't have to worry about system naming conventions and the hassles inherent with using them. For instance, if you have a disk with a specific UUID and a block device name of /dev/sda3, if you do all your work (and system/application customization) with that disk, as the name /dev/sda3, you might be in for a big headache if you have a system problem (or just install some new hardware and reconfigure) and Linux decides to rename /dev/sda3 as /dev/sdb3 (or "anything" else). If you're using UUID's, you can simply use the "tune2fs" command (shown below) to assign the original UUID back to the new logical device name, so /dev/sdb3 would function exactly as if it were /dev/sda3, without causing any issues with your Linux OS :):

1. If you don't know the UUID of your disk, you can find it by using one of the several commands below:

host # vol_id /dev/sda3
...
ID_FS_UUID=a1331d73-d640-4bac-97b4-cf33a375ae5b
...

or:

host # blkid /dev/sda3 <-- Leave blank to show all disks
/dev/sda3: LABEL="/" UUID="a1331d73-d640-4bac-97b4-cf33a375ae5b" SEC_TYPE="ext3" TYPE="ext2"

also:

host # ls -l /dev/disk/by-uuid|grep sda3
lrwxrwxrwx 1 root root 10 11. Okt 18:02 a1331d73-d640-4bac-97b4-cf33a375ae5b-> ../../sda3

2. If you prefer to generate your own UUID's (see above), you can use the uuidgen command and couple it with tune2fs to change the default UUID assigned to your disk by the system, like this:

host # uuidgen
1d721189-7b71-4315-95a7-1c3abc90d379
host # tune2fs -U 1d721189-7b71-4315-95a7-1c3abc90d379 /dev/sda3

3. Then again, if you already know the UUID, you might want to find out what disk it's associated with. You can generally get this information with the "findfs" command, like so:

host # findfs UUID=a1331d73-d640-4bac-97b4-cf33a375ae5b
/dev/sda3

Of course, using some of the commands above and grepping out part of the UUID will also get you your answer, like:

host # ls -l /dev/disk/by-uuid|grep a1331d73-d640-4bac-97b4-cf33a375ae5b
lrwxrwxrwx 1 root root 10 11. Okt 18:02 a1331d73-d640-4bac-97b4-cf33a375ae5b-> ../../sda3

or

host # blkid|grep a1331d73-d640-4bac-97b4-cf33a375ae5b <-- remember that blkid with no arguments returns all of the system disk
/dev/sda3: LABEL="/" UUID="a1331d73-d640-4bac-97b4-cf33a375ae5b" SEC_TYPE="ext3" TYPE="ext2"

4. And, lastly (for this post, at least ;), you can mount your disks using the UUID, and even incorporate that automated UUID mounting into your /etc/fstab. To mount directly from the command line, you can do something like this:

host # mount -U a1331d73-d640-4bac-97b4-cf33a375ae5b /directory/you/mount/this/disk/on

and you could instruct your system to mount this partition by UUID from within the fstab, as well. It works basically the same way that the LABEL keyword does:

host # cat /etc/fstab
...
UUID=a1331d73-d640-4bac-97b4-cf33a375ae5b /directory/you/mount/this/disk/on ext3fs defaults 1 1

And, at this point, you should be able to figure your way around using UUID's to manipulate your disk on Linux with no problem. Enjoy, and please "be careful" :)

Cheers,

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

A comment from Curt, who despises UUID"

I despise UUID with a passion!
An example:
One of my systems has two 200GB hard drives.
Each has over 16 partitions, multiple operating systems and
various data partitions. Now add an external USB hard drive
for backup and restore of partitions.
Imaging trying to figure out how to mount hd1,12 and back it up
to sde4.

Nightmare!

LABEL on the other hand is usable and understandable by humans.
LABEL allows a USB flash drive to always mount the same, solving
that problem.
;<I also dislike using SCSI for PATA drives, limiting partitions to 15,
but that is another story.

Eliminate UUID's and the Microsoft's that create them,
Curt

Wednesday, June 4, 2008

Shell Script To Monitor Disk Usage On Linux and Unix

Hey There,

Today, we're going to take a look at a simple shell script to monitor disk space usage. It's been quite a while since we've touch on that, going back to a post from last November regarding finding space hogs on overlay mounts. The script has been kept simple (basically checking every partition for one fixed percentage full) to highlight other features.

The main intent here was to set up a monitor that would be able to handle a variety of Linux and Unix Operating Systems (all dependant on the "uname -s" output from that system) and focus on that area distinctly. We've limited our initial list to HP-UX, Solaris, SCO and OpenBSD.

In this case we're using a simple case statement to enumerate through the four *nix's we have listed here. Obviously, we could easily add more operating systems, and their variations of the "df" command, to our list and, if it ever got too big, either roll them into an array or simplify the script so that more OS's would fall under the same umbrella.

Also, notice that we're stepping through parsing of the df output more tediously than is actually necessary. For instance, the creation of the df output could be parsed with sed all in one fell swoop. Again, although it is generally considered best practice to compact your script/code, our hope here was that this would be easy to follow for as many people as possible. Some folks learn better by tackling the tough-stuff and working back to basics and some of us learn better by starting with the basics and putting them all together to create the tough-stuff. It's a long and convoluted statement of philosophy, to be sure, but fairly descriptive of what we actually mean ;)

If you prefer, on the line where we parse the TABLE file and trim it with sed, you can take out this part (or add to it) as it was placed in there as an example of how to ignore a specific partition (/usr):

-e '/usr$/d'

So, the line:

sed -e '1d' -e '/usr$/d' ${BASEDIR}/TABLE >> ${BASEDIR}/TABLE2

could be changed to:

sed -e '1d' ${BASEDIR}/TABLE >> ${BASEDIR}/TABLE2

which could then be simplified even further (since you don't need to use -e, even though it's okay to, if you only have one instruction to pass to sed) to become this:

sed '1d' ${BASEDIR}/TABLE >> ${BASEDIR}/TABLE2

And the entire script could be thusly compacted and streamlined, etc.
In any event, I hope this can be of some help (or, at least, an inspiration to reach higher ;) to you!

Best wishes,

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/ksh

#
# dfvk.sh - Check partition % full
# across multiple OS'
# 2008 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

trap 'rm -f ${BASEDIR}/TABLE ${BASEDIR}/TABLE2 ${BASEDIR}/MAILER;exit 1' 1 2 3 15

#BASEDIR="/tmp"
BASEDIR="tmp"
LOGHOST=`hostname`
MAILEES="you@yourhost.com"
LIMIT=90
OSVERSION=`uname -s`

case $OSVERSION in

        SunOS | OpenBSD )
                df -k >> ${BASEDIR}/TABLE
                ;;
        HP-UX )
                bdf >> ${BASEDIR}/TABLE
                ;;
        * )
                df -v >> ${BASEDIR}/TABLE
                ;;

esac

echo "$LOGHOST is getting ready to bother you!" >> ${BASEDIR}/MAILER
echo >> ${BASEDIR}/MAILER

sed -e '1d' -e '/usr$/d' ${BASEDIR}/TABLE >> ${BASEDIR}/TABLE2
sed 's/%//g' ${BASEDIR}/TABLE2 > ${BASEDIR}/TABLE
cat ${BASEDIR}/TABLE |while read ONE TWO THREE FOUR FIVE SIX
do

        case $OSVERSION in

                HP-UX | SunOS | OpenBSD )
                                if [ $FIVE -gt $LIMIT ]
                                then
                                        print "$SIX partition at ${FIVE}% capacity" >> ${BASEDIR}/MAILER
                                else
                                        continue
                                fi
                        break;;
                * )
                                if [ $SIX -gt $LIMIT ]
                                then
                                        print "$ONE partition at ${SIX}% capacity" >> ${BASEDIR}/MAILER
                                else
                                        continue
                                fi
                        break;;

        esac

done

MAILCOUNT=`cat ${BASEDIR}/MAILER |wc -l`
if [ $MAILCOUNT -gt 2 ]
then
        cat ${BASEDIR}/MAILER |mailx -s "$LOGHOST : Potential Paging Threat!" $MAILEES
fi

rm -f ${BASEDIR}/TABLE ${BASEDIR}/TABLE2 ${BASEDIR}/MAILER

, Mike

Sunday, May 25, 2008

Safely Patching Your Veritas Root Mirror Disk On Linux Or Unix

Hey there,

It's been a long time since we've taken a look at anything "Veritas" (almost a few months now since we published a few posts regarding disk groups and volume groups in Veritas Volume Manager for Linux and/or Unix. Given the relatively broad nature of this blog, I sometimes wonder how we can ever stay entirely focused on any "one" thing for too long ;)

But, enough about us... For this "Lazy Sunday" post we're going to take a look at patching a root (or boot) mirror disk in VxVM safely. And by safely, we mean that you'll be able to fail-back to your root mirror disk as if nothing ever happened. That is, if something awful actually does happen.

The basic concept is simple, and applies to all brands and methods of root disk mirroring. When you're faced with having to apply patches to your OS (which invariably involves changes to your root disk), you always want to make sure that your root mirror is "golden" before you begin. You also want to make sure that it's taken out of the equation for the initial patch run, so you'll have a perfect failback device (less sweat, no accounting for tears ;)

The first thing you'll want to do, as per above, is to validate your root disk's mirror disk. For Veritas Volume Manager, every volume associated with the root disk must (well, technically, "should") have, at least, a single subdisk for each and every plex on the root disk and the root mirror disk.

For our example today, we'll consider that our root disk is c0t0d0s2 and its mirror is c1t0d0s02. They both belong to the default Veritas Volume: rootdg. Please also note that a lot of this output is "mocked up" to a certain degree since I'm not in a position to actually disassociate volumes on the computers I'm using for the sake of this post :)

You can check the state of your volumes with the "vxprint" command, like so (we'll use the ellipses (...) to indicate output that I've trimmed to keep this post under 50,000 words ;) :

host # vxprint -htqg rootdg <--- This output has been truncated as well, to highlight the mostly one-to-one relationship between subdisks (sd) and plexes (pl). As you can see, each of our two volumes on our rootdisk has at least one subdisk associated with each plex. We're going to ignore root_disk-B0 for this post (or not go into it too much) as this isn't really a "volume" but a way Veritas gets around the fact that it uses the part of the disk that most operating systems reserve (the bootblock - This, again, is enough material for another post entirely)

Disk group: rootdg 

dg rootdg       default ...
dm root_disk    c0t0d0s2 ...
dm root_mirror  c1t0d0s2 ...

sd root_diskPriv        - ...

v root_volume       - ...
pl root_volume-01   root_volume ...
sd root_disk-B0 root_volume-01 ...
sd root_disk-02 root_volume-01 ...
pl root_volume-02   root_volume ...
sd root_mirror-01       root_volume-02 ...

v swap_volume       - ...
pl swap_volume-01   swap_volume ...
sd root_disk-01 swap_volume-01 ...
pl swap_volume-02   swap_volume ...
sd root_mirror-02       swap_volume-02 ...

Now that we know we're good, even though it may have already been done, I find it's always good practice to install a new bootblock on the root mirror disk from the main root disk. The worst case scenario (assuming no typos ;) would be that you updated an existing bootblock with one that should, theoretically, be an exact match for your primary root disk (which is what we want) :

host # /usr/lib/vxvm/bin/vxbootsetup -g rootdg root_mirror

If you have other partitions on your root disk, that aren't listed in your vxprint output of the rootdg above, you can define them with the vxmksdpart command. You might have your /opt partition on the root disk, but not in the rootdg. Sometimes you'll see /home or even /var on the rootdisk but not associated with the rootdg. While it's considered "best practice" by Veritas to add these partitions to the rootdg before separating the disks, I've found that it's never actually been "necessary." The idea is that you associate the partitions, just so you can disassociate them a few minutes later (???)

Next, we'll disassociate (see what I mean ;) the root mirror disk plexes from the root disk, like so (you can verify that, for instance, swap_volume-02 is associated with the mirror disk in the vxprint output above):

host # vxplex -g rootdg dis root_volume-02
host # vxplex -g rootdg dis swap_volume-02

Now, well simply mount the root filesystem from the disassociated mirror disk on a temporary directory on the root disk and make a few quick file backups and edits, like so (Note that, for most Linux flavours, /etc/system noted below is actually /etc/sysctl.conf and /etc/vfstab is /etc/fstab):

host # mkdir /vxtmp
host # mount /dev/dsk/c1t0d0s0 /vxtmp
host # cp /vxtmp/etc/system /vxtmp/etc/system.old
host # cp /vxtmp/etc/vfstab /vxtmp/etc/vfstab.old
host # cp /vxtmp/etc/vfstab.prevm /vxtmp/etc/vfstab
host # touch /mnt/etc/vx/reconfig.d/state.d/install-db

Now, in the /vxtmp/etc/system file, we'll comment out the following two lines (remember that in the /etc/system file the "*" is the comment character. You probably already know that, but I feel responsible ;) -- Edit the following two lines so that they are now commented:

* rootdev ...
* set vxio ...

Then we'll unmount the root mirror disk on /vxtmp:

host # umount /vxtmp

and we're ready to patch! Assuming that everything goes swimmingly, all we need to do is reattach the root mirror disk plexes to the root disk, like this:

host # vxplex -g rootdg att root_volume root_volume-02
host # vxplex -g rootdg att swap_volume swap_volume-02

The root disk should sync itself up so that the root mirror disk gets updated (which you can monitor with "vxtask") And, you're all set :)

Now... If things go bad... The official explanation is so long and ridiculous (and differs for versions up to 3.5 and newer versions), that I'll refer you to an actual official document from Veritas online support that will show you a neat trick to get around having to jump through 15 or 16 hoops to get this all over with ;) Another glorious example of the system raging against itself :)

Cheers,

, Mike

Thursday, May 15, 2008

Finding An "Invisible" Proc's Working Directory Without lsof On Linux Or Unix

Ahoy there,

Today, we're going to take a look at something that gets taken for granted a lot these days. lsof (a fine program, to be sure. No debate here) has become a very common staple for finding out information about processes, and where they're hanging out, on most Linux and Unix systems today. Much like the command "top," it provides a simple and robust frontend to having to do a lot of grunt-work to achieve the same results.

I find that, for the most part, lsof is used to find out where a process is, or what filesystems, etc, it's using, in order to troubleshoot issues. One of the most common is the "mysteriously full, yet empty, disk" phenomenon. Every once in a while that will turn out to be an issue where all of the inodes in a partition have been used before all of the blocks have, which produces confusing output in df, leading to the mistaken assumption that there is plenty of space left a device even when there isn't.

However, many times, that empty-yet-full disk is the victim of a process that met an untimely demise and never cleaned up a lot of temporary space in memory (or virtual disk, to split hairs). Another issue that lsof is used for is to find out which dag-nabbed process is holding onto a mount-point that claims it's in use when no one is logged on and no user processes are running that would access it (for instance, a really specific, user-defined, mountpoint like /whereILikeToPutMyStuff - Hopefully the OS isn't depending on this to be around ;) Both problems are, essentially, the same.

However, should you find yourself in a situation where lsof either doesn't come with your Operating System, and/or hasn't been installed, you can still break down these two (and I'm just limiting the post to these two particulars so I don't end up writing an embellished manpage ;) separate issues into one, and find the solution to your problems using the commonly available "pwdx" utility.

pwdx will print out the working directory of any given process (using the process ID as input) at its best. But this is enough to get you to the answer you need.

For instance, we'll take this common scenario: /tmp is reporting 100% full, but df -k shows that /tmp is only at 1% capacity (99% of it is unused). My thinking here almost immediately gravitates toward vi, or some other program that opened up a buffer in memory (using /tmp or /var/tmp), got clipped unexpectedly and never let the system know that it was done with the space it allocated for itself. This would normally not be an issue but, since your Linux or Unix machine "thinks" /tmp is full, whether or not it actually is makes no difference. It won't let you use the free space :(

This command line could be used to figure out what process was using that space in /tmp or /var/tmp:

host # ps -ef|awk '{print $2}'|xargs pwdx 2>&1|grep -iv cannot|grep /tmp
2969: /tmp

Taking it a step further (assuming we trust our own output), we could just skip right to the process in question by adding a bit more to the pipe-chain:

host # ps -ef|awk '{print $2}'|xargs pwdx 2>&1|grep -iv cannot|grep /tmp|sed 's/^\([^:]*\).*$/\1/'|xargs -n1 ps -fp
     UID   PID  PPID   C    STIME TTY         TIME CMD
    root  2969  2966   0   Mar 21 ?           0:05 /bin/vi /home/george/myHumungousFile

Since it's May already, we can fairly assume that this PID is pointing to a dead process (especially since it has no TTY associated with it), and (double-checking, just to be sure) we can probably solve our problem by killing that PID. See our previous post on killing zombie processes if it won't seem to go away and "ps -el" shows it in a Z state.

Yes, that example was pretty simplistic, but the same methodology can be used to find other programs using up other filesystems. Just like "lsof -d," you'll be able to find out what processes are using what filesystems and narrow down your list of suspects, if you don't nail the correct one right away. Since pwdx comes with your Linux or Unix OS, it's actually statistically more likely than lsof to be correct about what process is using what filesystem :)

Cheers,

, Mike

linux unix internet technology

Tuesday, May 6, 2008

ZFS Command Sheet For Solaris Unix 10 - Pool And File System Creation

Hey There,

Today, we're going back to the Solaris 10 Unix well and slapping together a few useful commands (or, at least, a few commands that you'll probably use a lot ;). We've already covered ZFS, and Solaris 10 zones, in our previous posts on creating storage pools for ZFS and patching Solaris 10 Unix zones, but those were more specific, while this post is meant to be a little quick-stop command repository (and only part one, today). This series also is going to focus more on ZFS and less on the "zone" aspect of the Solaris 10 OS.

Apologies if the explanations aren't as long as my normal posts are. ...Then again, some of you may be thanking me for the very same thing ;)

So, without further ado, some Solaris 10-specific commands that will hopefully help you in a pinch :) Note that for all commands where I specify a virtual device or storage pool, you can get a full listing of all available devices/pools by "not specifying" any storage pool. I'm just trying to keep the output to the point so this doesn't get out of hand.

Today we're going to take storage pools and ZFS file systems and look at creation-based commands, tomorrow we'll look at maintenance/usage commands, and then we'll dig on destructive commands and cleaning up the mess :)

1. To create virtual devices (vdevs), which can, technically, be virtual (disk made from a part, or parts, of real disk) or "real" disk if you have it available to you, you can do this:

host # mkfile 1g vdev1 vdev2 vdev3
host # # ls -l vdev[123]
-rw------T   1 root     root     1073741824 May  5 09:47 vdev1
-rw------T   1 root     root     1073741824 May  5 09:47 vdev2
-rw------T   1 root     root     1073741824 May  5 09:48 vdev3

2. To create a storage pool, and check it out, you can do the following:

# zpool create zvdevs /vdev1 /vdev2 /vdev3
# zpool list zvdevs <--- Don't specify the name of the pool if you want to get a listing of all storage pools!
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
zvdevs 2.98G 90K 2.98G 0% ONLINE -

3. If you want to create a mirror of two vdev's of different size, this can be done, but you'll be stuck with the smallest possible mirror (as it would be physically impossible to put more information on one disk that it can contain. That seems like common sense ;)

host # zpool create -f vzdevs mirror /vdev1 /smaller_vdev <--- The mirrored storage pool will be the size of the "smaller_vdev"

4. If you want to create a mirror, with all the disks (or vdevs) the same size (like they should be :), you can do it like this:

host # zpool create zvdevs mirror /vdev1 /vdev2 /vdev3 /vdevn... <--- I haven't hit the max yet, but I know you can create a "lot" of mirrors in the same set. Of course, you'd be wasting a lot of disk and it would probably make data access slower...

# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
myzfs 95.5M 112K 95.4M 0% ONLINE -
host # zpool status -v zvdevs
pool: zvdevs
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
zvdevs ONLINE 0 0 0
mirror ONLINE 0 0 0
/vdev1 ONLINE 0 0 0
/vdev2 ONLINE 0 0 0
/vdev3 ONLINE 0 0 0

errors: No known data errors

5. You can create new directories, add file systems on them and mount them in your storage pool very easily. All you need to do is "create" them with the "zfs" command. Three tasks in one! (as easy as creating a pool with the zpool command):

host # zfs create zvdevs/vusers
host # df -h zvdevs/vusers
Filesystem size used avail capacity Mounted on
zvdevs/vusers 984M 24K 984M 1% /zvdevs/vusers

6. If you need to create additional ZFS file systems, the command is the same, just lather rinse and repeat ;)

host # zfs create zvdevs/vusers2
host # zfs create zvdevs/vusers3
host # zfs list |grep zvdevs
zvdevs 182K 984M 27.5K /zvdevs
zvdevs/vusers 24.5K 984M 24.5K /zvdevs/vusers
zvdevs/vusers2 24.5K 984M 24.5K /zvdevs/vusers2
zvdevs/vusers3 24.5K 984M 24.5K /zvdevs/vusers3

See you tomorrow, for more fun with Solaris 10 ZFS/Storage Pool maintenance/usage commands :)

Cheers,

, Mike

linux unix internet technology

Sunday, April 6, 2008

Troubleshooting To Find The Bottleneck On Unix and Linux

Hey there,

Today, we're going to follow up on yesterday's post regarding the definitions of swapping and paging on Linux and Unix, as well as our humble follow up post on clarification of the definitions of paging and swapping with the a tutorial on basic troubleshooting. Today we'll get our primary examples from Solaris Unix and point up the differences, where they exist, in extracting the same information from Linux. Our only really unique bent is that we'll be coming at the issue by considering paging and swapping activity and going from there, assuming no knowledge of what could possibly be causing the problem. All we know is we have a server that's "really slow," which is "not good" ;)

The first thing we'll do is hop on the machine and take a look at vmstat. You'll note that, in each example, I'm zeroing out the values we don't need to look at to make the individual examples easier to read.

This is what we see:

host # vmstat 1 5 <--- We're running vmstat to give us ouput every 1 second 5 times. We've removed the first line from the output below, because it is always a "summary line" (averaging all recorded activity since the last reboot) and can sometimes be misleading.

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s6 sd sd --   in   sy   cs us sy id
 - - - - - s u m m a r y l i n e - r e m o v e d - on - p u r p o s e - - - - -
 67 0 0  0     0     0   0  0  0  0  0  0  0  0  0  0    0    0    0  0  0  3
 56 0 0  0     0     0   0  0  0  0  0  0  0  0  0  0    0    0    0  0  0  9
 57 0 0  0     0     0   0  0  0  0  0  0  0  0  0  0    0    0    0  0  0  12
 64 0 0  0     0     0   0  0  0  0  0  0  0  0  0  0    0    0    0  0  0  6

This condition would indicate that we've probably got a problem with our CPU (Remember all the zeroed out values are assumed to be "normal" for this server). The "run queue" is very high (the first column "r"), which indicates that there are an average of approximately 59 processes waiting for CPU execution time at any given second. We couple this with the fact that the CPU "idle time" (the last column "id") is very low, along with the fact that there's no indication of any paging or swapping activity at all (which will almost never happen, really), and it becomes fairly obvious that the bottleneck lies with the CPU.

No special options to vmstat are required to see this information on Linux, but the "id" column is generally second in from the right.

Now, if we change this output slightly, the bottleneck most probably changes to our system's memory:

kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 sd sd -- in sy cs us sy id
- - - - - s u m m a r y l i n e - r e m o v e d - on - p u r p o s e - - - - -
67 0 0 0 0 12 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 77
56 0 0 0 0 15 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 67
57 0 0 0 0 10 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 78
64 0 0 0 0 8 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 85

Notice that in this next example, pretty much everything is the same, but the "scan rate" (the "sr" column) and "page reclaim" (the "re" column) values have increased dramatically. Generally, numbers like the ones I'm posting here wouldn't make my pulse change, but, for the sake of argument, we'll assume that the "scan rate" and "page reclaim" rate have been flatlining at 0 ever since this server launched. An increase in the "scan rate" indicates that the system is paging more heavily; that is, it's spending a lot of CPU cycles trying to manage writing from memory to disk and from disk to memory. One might assume that this situation would indicate a problem with the CPU, but the "idle time" doesn't agree with that assumption. Also, it helps to keep in mind that paging occurs more frequently when the system runs out of real physical memory to read from, and write to, and has to revert to using "disk based" virtual memory, which it interacts much more slowly with. This is bolstered by the additional "page reclaim" activity that is going on. Adding physical memory to this server will probably fix the bottleneck.

On Linux, in order to grab comparative "page reclaim" and "scan rate" values, you'll need to take a look at /proc/vmstat, like so:

host # cat /proc/vmstat|grep pgscan
pgscan_kswapd_high <--- These statistics are for the generic "scan rate"
pgscan_kswapd_normal
pgscan_kswapd_dma32
pgscan_kswapd_dma
pgscan_direct_high <--- This is were the generic "page reclaim" statistics begin
pgscan_direct_normal
pgscan_direct_dma32
pgscan_direct_dma

You can also check, specifically, for pages reclaimed by "inode stealing."

host # cat /proc/vmstat|grep pginodesteal

There also may be other variations, depending on the flavor of Linux you're running. Catting /proc/vmstat and doing a:

host # cat /proc/vmstat|egrep 'scan|steal'

should get you most, if not all, of them.

And, in our final permutation, the bottleneck becomes the disk:

kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 sd sd -- in sy cs us sy id
- - - - - s u m m a r y l i n e - r e m o v e d - on - p u r p o s e - - - - -
0 67 0 0 0 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 77
0 56 0 0 0 0 0 0 0 0 0 0 98 0 0 0 0 0 0 0 0 67
0 57 0 0 0 0 0 0 0 0 0 0 89 0 0 0 0 0 0 0 0 78
0 64 0 0 0 0 0 0 0 0 0 0 94 0 0 0 0 0 0 0 0 85

Again, these numbers aren't crazy, but now we've got an entirely different situation on our hands. We can see, from this example that our only disk (the "s6" column) is now starting to do some heavy writing (yes, I know the numbers aren't really that big. This is all relative ;) This might not be an issue, but the server's "slow." We couple this with a drastic increase in processes that are blocked waiting for I/O (the "b" colomn), along with a lack of significant swapping and/or paging activity, and the disk begins to look like the favorite. The "b" column can generally be considered a very vague indicator of the actual issue, since it reports on processes blocked waiting for resources no matter what they are (the CPU could be slow, the memory could be bad; even the network could be down). However, when we combine this with the fact that disk read/write activity has increased, the disk becomes our most probable bottleneck on the system.

On Linux, using either "vmstat -d" or "vmstat -p ARGUMENT" (with ARGUMENT being a specific partition) will get you the disk statistics.

These examples have been fairly stark. Always keep in mind that, outside of this vacuum, it's always good practice to keep regular tabs on your servers. For instance, you might run vmstat a few times a day (or a few times an hour) and record the results. When you get into habits like that, you're much better prepared when a problem does arise, as you'll have a good "baseline" to refer to when dealing with abnormal behaviour on any of your systems. Everyone will notice the biggest number in the vmstat output when they're frantically trying to figure out what's wrong, but (armed with your "baseline" knowledge) you'll know if that number has always been gigantic, in which case you can ignore it and go on to fix the actual problem :)

Hopefully this little exercise has been helpful to you; if even in the tiniest way :)

Cheers,

, Mike

linux unix internet technology

Saturday, April 5, 2008

Further Dissection Of Paging And Swapping On Linux And Unix

Hey again,

Believe it or not, I actually got a few emails about our previous post on paging and swapping in Linux or Unix because it wasn't specific enough ;) While I can certainly understand the frustration, I was hoping to explain the main differences as completely and concisely as possible. I seem to have faltered a bit on each front: I glossed over two specifics which, I agree, deserve some attention and I wrote yet another novel ;)

With that in mind (and with a prayer that my fingers won't type any more than they have to ;) I'd like to address, and/or clarify, the level of depth I didn't descend to in my last post on the difference between paging and swapping and write about the difference between paging and swapping (The redundancy was intentional and any resulting confusion is expected, given the topic at hand and my writing style ;)

As I mentioned previously, the terms paging and swapping are used almost interchangeably these days. Some industry manuals will actually talk about "swapping out pages" which seems to be contradictory and, theoretically, impossible if swapping and paging are two separate concepts with distinct and unique definitions. This is where language and implied meaning become a barrier to actual definition. And, all the more reason to clarify this one last bit of the puzzle.

And here they come. The extra clarifications...

1.  Difference in resident virtual memory management with paging and swapping.

When a system swaps a program, or process, it guarantees that it is resident (on disk or in memory) before it schedules it for execution, and will often hold onto the mapped resources reserved for that process from the time the process requests them until it notifies the scheduler that it is complete. When a system pages during the execution of a program, or process, there isn't any such direct correlation. You don't necessarily know (without specifically checking) how much of a process's virtual memory is resident or whether the process is entirely able to be scheduled for execution at the time paging begins. Pages can be selectively grabbed from a process (out of mapped physical memory) and never returned, unless re-requested by the process when it looks for the memory, can't find it and generates a page fault.

Phew... That's one down. Hopefully this isn't just becoming more confusing :)

2. More specific definition of paging and swapping with regard to page ins/outs and swap ins/outs.
In this instance, swapping specifically refers only to the transfer of memory pages from physical memory to dedicated swap devices or swap disk (on most systems this is now referred to as swapfs - or a unique swap filesystem) and vice versa. Paging, on the other hand, refers to the transfer of memory pages from physical memory to disk (regular disk or swap disk) and vice versa. So, really, the major difference is that swapping is limited to only transferring memory pages back and forth between the physical memory and a dedicated swap device or filesystem, while paging can transfer between physical memory and any sort of disk device.

Hopefully, we've reached a sufficient amount of explanation at this point, and this thing won't turn into the monster I'd hoped it wouldn't become ;)

Thank you, everyone who wrote in, for your helpful input. As many folks have also noted, we don't have comments set up on this blog (because of issues with "comment spam" and not wanting to get shut down). If you ever want to leave a comment, or an objection, we welcome you to email us directly at our most often-check email address or, if you have a lot to say (or want to attach video, etc), sign up (for free) on our sister Linux and Unix Menagerie Forum, and we'll get your remarks there, as well. We do our best to reply, personally, to everyone who takes the time to write us. So far, I think we're still batting a thousand in that regard ;)

Best wishes,

, Mike

linux unix internet technology

Friday, April 4, 2008

Swapping Or Paging On Linux And Unix?

Howdy,

Here's a question that gets asked a lot, and has a relatively simple answer to go with it: On Unix and/or Linux, what's the difference between paging and swapping?

It's a relevant question, given that the terms are used almost interchangeably these days. Even in most Linux or Unix monitoring commands, the issue can become confused. Consider our previous posts on free memory graphing on Unix and graphing out paging statistics on Linux. They're both showing approximately the same thing, but one of them is using the terminology in a not-totally-correct sense.

The good news is you only need to understand one thing about each (which is also a common thread) in order to understand what the terms "really" mean. This can be a great help when you're trying to determine the cause of a system issue, like a big slow-down. Of course, since the terms are mixed up a lot, it's a good rule of thumb to assume that any problem with "paging" or "swapping" may be a problem with either. Depending upon who's asking, they could mean one thing or the other. As in public speaking, it's always a good idea to know your audience ;)

The main difference between paging and swapping (on both Linux and Unix; all flavors, as far as I know) is this:

1. Swapping: This occurs when an entire process ( sometimes consisting of multiple parts like a read-only text segment, writable data segment and, more often nowadays, writable stack segment ) gets transferred to disk from physical memory or is read back into physical memory from the disk.

2. Paging: This occurs when part of a process ( a page, or a segment, of a process ) gets transferred to disk from physical memory or is read back into physical memory from disk. Paging also requires a MMU (Memory Management Unit) and a CPU capable of handling requests from it. This is just a side note, and slightly outside the scope of the definition. It really doesn't even make a difference any more since I haven't seen an OS without paging capability in years, and most dedicated Unix/Linux servers have had the latent capability for even longer.

Tomorrow, we'll begin looking at a real-life examples of determining a system issue highlighted by excessive paging (or is it swapping?). For today, we'll keep it abstract.

To wrap up, on today's system's (The year now being 2008 - Just dating this in case it gets read 2 years from now and I'm totally off-base by then ;) there's almost no such thing as swapping. Paging occurs normally and, if you do see actual heavy swapping, it's generally an indication of a problem with memory or disk (Except in situations where you have large applications - like an Oracle database, for instance - that hoard lots of Virtual Memory Address (VMA) space and cause the system to swap naturally). In somewhat contrast, if your system is paging heavily, but not swapping, your issue is most likely with CPU or memory. Memory is often mistakenly assumed to be the culprit in most situations because both swapping and paging involve writing to, and reading from, memory. However, it should always be taken into account what other component of the OS is doing the work to make that activity possible, or maybe even necessary.

One last thing to remember is that either of these situations ( excessive swapping, excessive paging or both ) could be indicators of either memory, CPU or disk issues. They could also point to a problem with your network subsystem or any number of things. The generic explanations/answers in the previous paragraph assume a relative norm. In reality, you have to look at the situation in the context of the problem you're facing on the system that's having the issue and work from there.

We'll run down some quick and easy real-life troubleshooting starting tomorrow.

Until then, best wishes :)

, Mike

linux unix internet technology

Monday, March 10, 2008

Shell Script To Report Linux Server Hardware Information

Please click above for a slightly larger view of the beginning of the output today's script provides :)

Hey There,

Well, I guess it's about time we starting putting some more shell scripts out there. The last 3 or 4 posts have all been how-to's (except the last one, which I suppose you could trim all the surrounding text and make a script out of ;) and it's high time to start hitting the shell again.

Today's offering is something we cooked up to tiptoe the fine-line between producing what a manager wants to see and what an administrator wants to see in a quick system profile. This has been tested on RedHat Linux and SUSE (only up to release 9.x). The only major difference is some extra output in the "SERVER - MEMORY" section (mostly when run on x86_64 architecture machines) that some of you may find useful.

If you're interested in something more basic, or generic, check out our previous posts on gathering system information on Solaris and gathering system information on RedHat Linux.

This is a pretty straightforward shell script offering that basically parses the output of the hwinfo command. We run it in "--short" mode for most options, but leave it long for parts where the shortening process removed vital information (Like the brand name of the server). It's formatted loosely, but is fairly easy to read. One of the things I like most about it (and the main reason I started writing it in the first place) is that it highlights the Manufacturer, Model and Serial number of the machine your Linux OS is running on. This generally isn't an issue when you're, say, running Solaris on your Sun box ;) Then, of course, I couldn't get away from putting in all the basic information about CPU, Memory, Disks, etc.

If you want to know more about your system than this little shell script will show you, the hwinfo command has a variety of options I chose not to include (Neither my manager nor I want to know about every little "debug" detail of the PCI controller unless we have to ;), but you can access just about any hardware related information using that command. Just run it as:

host # hwinfo --help

Assuming, of course, that you've run this script already as:

host # ./server_info.sh

and found it lacking.

If hwinfo isn't available on your machine (Oh, yes. Be sure you're "root" when you run this or you might not have the access required to pull some of the information hwinfo tries to get for you!), there are a number of other options available to you, both on SUSE, RedHat and different flavors of Linux. Off the top of my head, you can always give these commands a shot (assuming they exist ;) --> kudzu, lspci, lsusb, dmidecode and a great project (which even has a GUI now) called lshw. You should check that out if you or your manager dig this little shell script :)

Cheers,

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/bash

#
# server_info.sh - display server hardware info
#
# 2008 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
# 

hwinfo="/usr/sbin/hwinfo --short"
hostname=`hostname`
separator="----------------------------------------"
echo $separator
echo "System Information For $hostname"
echo $separator
echo $separator
echo SERVER - MEMORY
echo $separator
/usr/sbin/hwinfo --bios|egrep 'OEM id:|Product id:|CPUs|Product:|Serial:|Physical Memory Array:|Max. Size:|Memory Device:|Location:|Size:|Speed:|Location:'|sed -e 's/"//g' -e '/^ *Speed: */s/Memory Device:/\n  Memory Device:/' -e 's/\(Max. Speed:\)/CPU \1 MHz/' -e 's/\(Current Speed\)/CPU \1 MHz/'
echo $separator
echo SMP
echo $separator
$hwinfo --smp
echo $separator
echo CPU
echo $separator
$hwinfo --cpu
echo $separator
echo CD_ROM
echo $separator
/usr/sbin/hwinfo --cdrom|egrep '24:|Device File:|Driver:'|awk -F":" '{ if ( $1 ~ /[0-9][0-9]*/ ) print $0; else print "  " $2}'|sed -e 's/^.*[0-9] //' -e 's/ //' -e 's/"//g'
echo $separator
echo DISK
echo $separator
$hwinfo --disk
echo $separator
echo PARTITION
echo $separator
$hwinfo --partition
echo $separator
echo NETWORK
echo $separator
$hwinfo --network
echo $separator
echo NETCARD
echo $separator
$hwinfo --netcard
echo $separator

, Mike

linux unix internet technology

The Linux and Unix Menagerie

Tuesday, November 25, 2008

Quick And Easy Local Filesystem Troubleshooting For SUSE Linux

Thursday, October 30, 2008

LVM's Roots - Mirroring Your Boot Disk On HP-UX 10 Unix

Wednesday, October 8, 2008

Puppy Linux Live Trumps LinuxDefender In More Ways Than One

Tuesday, September 30, 2008

How To Resolve Veritas Disk Group Cluster Volume Management Problems On Linux or Unix

Wednesday, August 20, 2008

How To Manage Your Disk By UUID On Linux

Wednesday, June 4, 2008

Shell Script To Monitor Disk Usage On Linux and Unix

Sunday, May 25, 2008

Safely Patching Your Veritas Root Mirror Disk On Linux Or Unix

Thursday, May 15, 2008

Finding An "Invisible" Proc's Working Directory Without lsof On Linux Or Unix

Tuesday, May 6, 2008

ZFS Command Sheet For Solaris Unix 10 - Pool And File System Creation

Sunday, April 6, 2008

Troubleshooting To Find The Bottleneck On Unix and Linux

Saturday, April 5, 2008

Further Dissection Of Paging And Swapping On Linux And Unix

Friday, April 4, 2008

Swapping Or Paging On Linux And Unix?

Monday, March 10, 2008

Shell Script To Report Linux Server Hardware Information

Bookmark Us!

LXer - Linux News Feed

Linux And Unix Resources

Blog Archive

Top Post-Label Index