Tuesday, February 3, 2009

JumpStart Symlinks And Solaris 10 Unix

Hey there,

Today's topic covers a little issue I ran into at work (which I actually do from time to time ;) that had me puzzled for a bit. If you're a grizzled Solaris/Slakware veteran like me, some of the newer features of Solaris 10 are pretty much lost on you until you absolutely "need" to understand them ;) Case in point, NFS sharing and JumpStart on Solaris 10 and/or ZFS (a point we seem to have overlooked in the onslaught of information in our post featuring Solaris' 5/08 Release Notes. Do a find for "NFS" - it's a realllllly long release note :)

The issue we had going on was that we'd gotten in the habit of running a JumpStart, and then sometime later (maybe days, weeks, who knows?) running another, etc; money's tight, not too many new machines to build and everything worked fine. Solaris 10 was doing great (That isn't to say that it didn't do great the entire time we had this problem ;) The issue was more of a fundamental misunderstanding and lack of knowledge on my part). Then, a few days ago, we had the time and the means, and I began a two-fisted JumpStart (Nothing violent, just concurrent installations ;)

Then, after I'd run through this a few more times than I'm proud to admit, I finally threw up my hands after yet another attempt at double-JumpStarting failed. The failure was very specific so I was fairly certain I was either doing something wrong or I was doing something wrong ;) Basically, both machines would boot up to the net and begin their JumpStart installations. All would go swimmingly; finding the JumpStart host, grabbing the correct profile, finish scripts and sysidcfg files. The bummer was that, after completing the full install of Solaris 10 (yes, it made us wait until after "all" of the software had been installed) it would get this funky error (hopefully, you've seen it before, and this post might just help you out :) :

Completed software installation

Solaris 10 software installation succeeded

Customizing system files
- Mount points table (/etc/vfstab)
- Unselected disk mount points (/var/sadm/system/data/vfstab.unselected) - Network host addresses (/etc/hosts)

ERROR: Could not open file (/etc/hosts)

ERROR: Could not set up the remote host file (/etc/hosts)

ERROR: System installation failed
Solaris installation program exited.


????????????

Okay, so we were kind of stumped (well, totally stumped until we figured out the solution - by definition, I think ;) It turns out that one of our procedurals before initiating the "boot net - install" portion of single server JumpStarts was actually contributing to our confusion about what the problem really was. The reason for that is because we would always sync our local JumpStart server with the master. Good practice, but (in this case) a bit of a diversion. Anyway, we'll get to why that mattered in a bit ;)

Investigation into the matter (which, after several failed installs, consisted of crashing the install in the mini-root to check the state of the JumpStart temporary mount configuration in real-time) revealed something interesting. If you look up the page a little (or just remember ;) the killer error we got was:

ERROR: Could not open file (/etc/hosts)

ERROR: Could not set up the remote host file (/etc/hosts)

ERROR: System installation failed


Looking at the state of the filesystem, after crashing at the point of error, revealed this directory structure in the temporarily mounted /etc filesystem (stripped down a bit for brevity's sake):

...
-r--r--r-- 1 root sys 99 Feb 2 16:48 hosts
...
-r--r--r-- 1 root sys 91 Feb 2 16:48 ipnodes
...
-r--r--r-- 1 root sys 384 Feb 2 16:48 netmasks
...


The reason that's interesting is that those files should have looked like this:

...
lrwxrwxrwx 1 root other 29 Feb 2 16:56 hosts -> ../../tmp/root/etc/inet/hosts
...
lrwxrwxrwx 1 root other 31 Feb 2 16:56 ipnodes -> ../../tmp/root/etc/inet/ipnodes
...
lrwxrwxrwx 1 root other 32 Feb 2 16:56 netmasks -> ../../tmp/root/etc/inet/netmasks
...


Essentially, Solaris 10 was converting special symlinked files into straight-up flat-files during the JumpStart process. Once that was complete (and the files were corrupted) it really "couldn't" open them up, because the real /etc/hosts file was supposed to be in the /tmp/root/etc directory and not the local one (which is on JumpStart's read-only mini-root filesystem)!

It turns out that this problem (as far as we were interested in figuring out) seems to manifest itself in Solaris 10 for the most part. It may happen in Solaris 9, but we can't go back now!!! :)

And the root cause was... drum roll, please, as I build up to feeling really stupid ;)

The JumpStart mini-root directory was being served up via NFS "read/write"! Doh! And (I'm bringing this back from up top, just as I promised) our procedure for doing JumpStarts had actually made this harder to see. Since we only did one JumpStart at a time, the initial JumpStart would work (even though it left behind a corrupted filesystem). And, per our procedure, right before we kicked off the next one, we'd sync that filesystem up with the known-good master JumpStart server. At no point was this filesystem corruption ever noticed due to the fact that we never had to jump more than one box at a time. Crazy ;)

Anyway, long story short, the quick and simple fix was to unshare the mini-root (/JumpStart/Sol10 for instance, or whatever yours might be) and then reshare it as "read only." If you've been doing this all along (which you should be ;), you'll never have our problem ...probably ;)

Depending upon how your system is setup, you can share NFS a number of ways in Solaris 10. After I ran "unshareall," I was a bit puzzled as to why /etc/dfs/dfstab was empty (See what I meant before? ;). Since I'm such a dinosaur, I wrote it off, figuring somebody had run the share command at the command line and forgot to put that command in an init script or the dfstab file. On many occasions, I would have been correct (and Solaris 10 does still support this type of NFS sharing). The cool thing here is that our JumpStart mini-root was living on a Solaris 10 ZFS dataset. So, all that had to be done to correct the issue (after I uncorrected my incorrect correction ;) was to adjust a property of the dataset in Solaris 10 (actually a very cool feature, I think :) Since the ZFS datasets have a "sharenfs" attribute built-in (set to "no" by default), all we had to do was to change that. A very simple command line to type and a very simple solution to a seemingly complicated issue:

host # zfs set sharenfs=ro,anon=0 maindg/jumpstart/Sol10

Problem solved. And only 3 or 4 hours wasted (I mean, well spent ;) Hopefully this pitiful little tale of woe will get you out of a similar jam sometime :)

Cheers,

, Mike




Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!



Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.