Friday, August 15, 2008

How Avoid Solaris Panics When Using Savecore With Veritas Volume Manager

Hey There,

Ever since Solaris 7, the savecore command has been able to act somewhat like "gcore," but better. The introduction of the "-L" flag allowed users to take a crash dump of the entire running Solaris system (much in the same way gcore does for individual processes). When invoked with this flag, savecore will take a snapshot of the live system and write it to the system dump device. This dump device has to be "dedicated," and defined in /etc/dumpadm.conf. Out of the box, this dedicated dump device is set to your system swap partition, so you may have it set up correctly even if you've never given it a second thought (or a first ;). This, in and of itself, isn't anything spectacular, since you can use "fssnap" to do pretty much the same thing and have finer control over what gets "snapped" and where it gets put. The beauty of "savecore -L /some/other/output/directory" is that, after writing the snapshot to your dedicated dump device, it will suck up that data and create crash dump files wherever you've instructed it to. It's kind of like a combination of "fssnap" and "gcore," if you could manage to get those two programs to cooperate with each other and create a crash dump of your live system.

Of course, the point of today's post isn't to debate the merits of using this feature or even (as it may seem) glorifying it in any way. The preceding paragraph was simply meant as introduction to a point-of-disaster that is still just waiting to happen on Solaris boxes all over the world. If you prefer to keep your machines at the "ok>" prompt, this post may not be for you.

The disaster, itself, is not brought on solely by savecore and the fault doesn't really lie with the program itself. However, if you opt to manage your filesystems with Veritas Volume Manager, depending on how you go about things, the combination of the two can be "interesting" at best ;) Of course, in order to cause this panic/crash/freak-out to happen, things in your Solaris dumpadm configuration and Veritas Volume Manager setup have to be constructed in a very specific way. Unfortunately, that very specific way is the default for a lot of shops that use Volume Manager to encapsulate the root disks on their systems. That, and having a default (or even a modified dumpadm.conf that doesn't stray too far from convention) is all that's required. If this situation could only occur on the third Tuesday of every month, on alternate leap years during a "classic syzygy" (when the sun moon and earth lie in a straight line with one another), I probably wouldn't be writing this post right now. So, in a way, it's a good thing, because I was initially going to make today's bullet-list about 50 ways you can save money, and avoid responsibility, by catching up on sleep ;)

So, after that lengthy preamble, here's a breakdown of one way to "break down" your Solaris system running VVM, step by step:

1. Ensure that your dumpadm.conf is "normal": This is a default /etc/dumpadm.conf (put together for us by Solaris during the installation process):

host # cat /etc/dumpadm.conf
# dumpadm.conf
# Configuration parameters for system crash dump.
# Do NOT edit this file by hand -- use dumpadm(1m) instead.

It looks good. According to the way things have always been done, it's just about perfect. Solaris, as far back as I can remember, has always recommended using your swap partition as your dump device if you enabled savecore (which you had to manually set up before Solaris 7), and the save directory is the default that it's always been (/var/crash/yourhostname).

NOTE: If you stray from this setup by changing your DUMPADM_DEVICE to another unused partition (preferably on a separate disk), you'll never experience this disaster (at least not for the same reasons...)

2. Encapsulate your root disk using Veritas Volume Manager: You may recall, from our past post on how to mirror your root disk using Veritas that our setup went through pains to make sure that the end product was "not" encapsulated. The reasons for this were different, but, if you followed those (or other similar) instructions, you won't be able to recreate this exact "magnificent failure" either.

3. Run "savecore -L /whatever/directory/you/want" on a machine in which the above two conditions exist.

And, that's it! Your results may vary, but general system panic is the most accurate way to describe the plethora of confusing hex error messages and resulting system crash that will most likely occur.

Now, let's take a look at the reasons "why" this can happen:

1. If you're using Veritas Volume Manager and your root disk (assuming this is where your DUMPADM_DEVICE, or swap partition, resides) is "not" encapsulated, dumpadm would recognize that the device it has listed as a dump location is being used by Veritas and can't be used by itself. This results in dumpadm refusing to run "savecore -L." Since the root disk is encapsulated, dumpadm has no idea that the DUMPADM_DEVICE is being accessed by Veritas through a different logical device. Basically, Solaris checks the swaplist and determines that the DUMPADM_DEVICE is not listed as a swap device (it "is" a swap device, but encapsulation only considers the Volume to be a swap device, and not the partition associated with it). Since the partition underlying the Veritas Volume is not listed as a swap device, it shows up as "dedicated." Therefore, your "savecore -L" command is processed as though all were well.

2. Actually, this is just an extension of point 1. But, the disaster happens when you run "savecore -L." Since the DUMPADM_DEVICE is pointing to the same region of disk that the Veritas swap Volume is located, when the command gets run, information stored in the swap partition and /tmp filesystem can easily become corrupted. It's virtually guaranteed to, since Solaris and Veritas are both acting independently of each other and reading/writing from the exact same place (both assuming they're the only one with exclusive access).

This problem is highly prevalent in shops that use "defaults" and recommended "best practices." (e.g. the default /etc/dumpadm.conf on a Solaris box directly falls into contradiction with Veritas Volume Manager's preference toward encapsulating the root disk). Fortunately, it is very easy to fix in one, or a few ways:

1. Change your /etc/dumpadm.conf file so that the DUMPADM_DEVICE is on a disk other than the root disk (or its mirror, if you have that set up). /usr/sbin/dumpadm only runs when it's called, so that's all you have to do. Veritas (I mean Symantec) themselves recommend that you "never" run "savecore -L" on an encapsulated root disk.

2. Follow the instructions in our old post on mirroring your root disk with Volume Manager and dumpadm will be able to determine that its default DUMPADM_DEVICE can't be used, which will nudge you toward implementing the first solution if you really want to make use of this aspect of savecore's functionality.

...and then any of various combinations of the two (given the variety of site installations of Solaris and different ways folks like to set their disks up with VVM). All you really have to do is avoid just one of these specific conditions that will almost always cause the problem.

Here's to not spending our nights and weekends at work :)

, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.