Showing posts with label prom. Show all posts
Showing posts with label prom. Show all posts

Thursday, February 14, 2008

Extended Options For Solaris' Jumpstart Boot

Hey There,

Most Solaris Admins are familiar with the basic procedure for kicking off a JumpStart installation. Bare bones, it's usually getting down to the PROM and executing one of either:

ok> boot cdrom - install

or

ok> boot net - install

There are, however, quite a few more options you can throw at even the older Solaris boot PROMs to kick off your JumpStart the way you like. Note that in the below examples, the "|" character is used to indicate an either-or option.

For instance, you can boot directly off of the local hard disk with this option:

file://jumpstartDirectory/compressedConfigurationFile

or from an NFS server, like so:

nfs://serverName|IP/jumpstartDirectory/compressedConfigurationFile

Even an http server:

http://serverName|IP/jumpstartDirectory/compressedConfigurationFile

To flesh out the above examples to a certain degree, these would the type of commands you would type at the PROM ok> prompt:

ok> boot cdrom - install file://jumpstartDirectory/compressedConfigurationFile
ok> boot net - install nfs://serverName/jumpstartDirectory/compressedConfigurationFile
etc...


In certain modes, you will boot from a tar file on your JumpStart server. Just make sure you've placed your sysidcfg inside your tar ball and try this:

ok> boot net - install http://192.168.0.1/jumpstart/config.tar

And, if you use an http server, you don't even need to go through all the hassle we explored in a past post about using JumpStart across multiple subnets, because you can add the "proxy" option to that argument (much like our workaround for JumpStart across subnets, you need to specify the IP address of the proxy, and not the hostname), like so:

ok> boot net - install http://xyz.com/jumpstart/config.tar&proxy=192.168.0.0.1

Hopefully, some of this trivia will prove useful to you at some point. Now there should be almost no way you can get out of having to JumpStart that new server ;)

Cheers,

, Mike




Friday, December 14, 2007

Why Horrible Sun Boot Problems Aren't Always All That Bad

I had an experience at work recently that had me shaking me head (and wishing I'd left for home a few minutes earlier ;) One of our v490 servers, that was already racked,cabled up and ready to have the OS built and put on the network the following day, decided it just wasn't going to boot up; not even to an ok> prompt!

Without the keyswitch set to run extended diagnostics, the situation looked pretty severe. This is about all I saw before it would power back down to nothing:

1:0>Waiting for master in slave_spin() CPU=0:0, timeout in 29 seconds...
2:0>Waiting for master in slave_spin() CPU=0:0, timeout in 29 seconds...
3:0>Waiting for master in slave_spin() CPU=0:0, timeout in 29 seconds...
1:0>
1:0>ERROR: TEST = Slave Spin
1:0>H/W under test = CPU, Motherboard/Centerplane, I/O board, (system init)
1:0>Repair Instructions: Replace items in order listed by 'H/W under test' above.
1:0>MSG = ERROR :Timeout waiting for master, doing re-config reset.
1:0>END_ERROR


And "nothing!" Anyway, as is our company's policy, I placed a call to Sun Support and their suggestion, as is suggested plainly by the error above, was to have a Field Engineer come out and replace the CPU boards (including the CPU's and memory - which is actually faster), and if that didn't work, replace the motherboard, the centerplane and the I/O board, progressively, until the error went away. You can see why I wasn't too happy, right? We're talking about a potential 10 extra hours of work doing parts replacement, followed by diagnostics, followed by possible extra parts orders, replacements, diagnostics, add infinitum (if not ad naseum ;)

Here's the kicker. After hooking up a laptop to the ALOM port, we started the system up with extended diagnostics. It wasn't looking much better. In fact, it gave a lot of confusing errors, like (and I'm paraphrasing here, because I stopped logging my diag output after a while):

FATAL ERRORS:
This version of v490/890 servers only support Ultra IV Processors
CPU's Online:
cpu #0 - Ultra IV 1500
cpu #2 - Ultra IV 1500


What?? That seemed contradictory to me. So we did what isn't generally a good idea (unless your machine appears to be in a state of complete ruination anyway) and pulled the plug, let it idle and powered it back on with the diagnostic keyswitch set. This time it gave us a little more information, and - lo and behold - in between the thousands of diagnostic messages (in between the FATAL ERRORS and the "slave_spin" errors) this line popped up:

OBP/Flash version 4.16.4 does not support part number ##### (Which happened to be the part number of both of our CPU boards).

This was great news! But how to fix it? Of course, replacing the centerplane (which, if you've ever done it - or even watched it being done - understand that it can be a painstaking and extended process) would fix the problem. On the v490 server, the OBP resides on the centerplane, so that was one option (If we'd have followed Sun's advice, of course, we would have already gone through replacing both CPU boards and, possibly, the motherboard before getting to that point!)

Our system OBP/Flash version was 4.16.4, and for the 1500 CPU - Ultra IV CPU boards, we needed to be up to OBP/Flash version 4.18.1. Clearly the CPU boards had been put in the v490 without regard to whether or not they were actually compatible ;)

Our next step was to take an old CPU board and replace the two new ones with it (just to test) and, magically, the machine booted perfectly. None of the system components listed were in a state of failure, or on their way to failing. The 1350 CPU board we put in only required OBP/Flash version 4.15.6 to be supported, and our centerplane OBP exceeded that level.

Our options boiled down to, as we saw it then, installing the OS on disk while we had the one 1350 CPU board installed, downloading the latest OBP/Flash and installing it, and then shutting down and booting up with the two new 1500 CPU Ultra IV boards (While this was a perfectly workable solution, it seemed like there must be a faster way to do it). Net booting was also an option, but that would require modifying our net boot server and might also cause other unforeseen complicatons. We also didn't want to have to have Sun replace the centerplane, as this wasn't any more guaranteed to work than our system-install method.

We eventually ended up bringing a Sun FE on site and got the surprise of our lives (or at least our present days ;) Luckily, Sun FE's have a CD/DVD (So far as I know, it's been around for about a year and is only available to Sun personnel) called SUE (which stands for Sun Utility Environment - or something like that - I was sneaking peaks). This is a tool that's time came a long while ago. With it, the FE was able to boot us to the ok> prompt (using the 1350 CPU board) and run the OBP/Flash upgrade directly from CD!

That's stretching the truth somewhat - SUE actually creates a mini-boot environment in on-board memory and sets up a temporary alias so that you can reboot and upgrade the OBP/Flash. So, instead of having to install the OS, boot the machine into network mode, download the latest OBP/Flash and then reboot with the new flash file, like so (somewhat abbreviated):

init 0
ok> boot disk /flash-update-v490
<--- or whatever the OBP/Flash upgrade file was called.

We were able to update the OBP/Flash by just booting off of the SUE CD, picking the OBP/Flash upgrade from the list available on the CD and letting it do a :

reboot -- cdrom /flash-update-v490

That was a "huge" time savings! Hopefully, Sun will make this CD, or a CD utility like it, available to users (or, at least, contract holders) in the near future.

So, as it turned out, that absolutely horrible boot problem wasn't really all that bad. Rather than replacing every single piece of hardware on the system until we found the one that was bad, all we had to do was upgrade the OBP/Flash on the system!

Sometimes the most complicated problems have the simplest solutions :)

Best wishes,

, Mike





Tuesday, December 4, 2007

Solving Network Issues at the PROM Level

As you may recall, in a previous post we looked at how to disable Sun network devices at the PROM level. In this post we're going to look at the the flipside, somewhat, of that coin.

This situation can occur quite frequently if you're building boxes on a network where the folks who run the switches and routers insist on "pinning" their ports to "100 full" or "1000 full" (speed - duplex) rather than allowing for auto-negotiation. While this is a simple problem to fix at the OS level (See previous post referred to above), it can become an issue when booting from the PROM.

And, yes, generally, even this issue is easily solved on a machine that's already set up. You can boot up in "100 half," bleeding packets by the truckload, and then set the correct parameters for the network device after logging on. However, if you're trying to install a machine over the net (using Jumpstart, for instance) the out-of-the-gate setting of "100 half" on your network device most likely won't sustain an initial connection to your Jumpstart server, much less an entire install.

Luckily, it is possible to set parameters for your network device at the PROM level. This has saved me more than a few times. There's nothing less enjoyable (unless you're just looking for some "quiet alone time" ;) than having to switch CD's at the console of machine after machine after machine, during an initial server build-out.

Setting up your network interfaces, to comply with the pinned network ports, is also very easy to do. It's so simple, I was surprised the first time I learned how to do it. Like most things with the PROM, exact spelling (characters alpha and numeric) is the difference between success and failure, so it's good to keep this info in your head (or in a crib-sheet). In another previous post we discussed the "sifting" command, which, unfortunately, won't help you much here since the main command we're looking for is "boot." Everything that makes this work is an option to that command.

You can set two properties for the "boot" PROM command that will make all the difference in the world. You can set the "speed" and you can set the "duplex." Oddly enough, those are almost always the exact two things that are giving you a headache ;) So, if you need to connect to your network at "100 full," instead of "boot net - install," you could type in the following:

boot net:speed=100,duplex=full - install <--- Substitute speed and duplex values to your liking.

It's as simple as that. Now you can boot from the PROM and know that your connection is going to be forced to the same speed and duplex that the network admins have pinned the ports to. In a future post, we'll go into how to determine if the "net" devalias isn't pointing to the proper network device in PROM, which devices are active and how to get around that issue.

Barring any other problems, your network install (or just network boot) should now work perfectly :)

, Mike





Sunday, December 2, 2007

Sifting Through the PROM

Every once in a while, like most admins (I hope;), I'll find myself in a situation where I'm working on a downed Solaris box and stuck at the PROM level. For the purposes of this post, we'll assume that diagnostics over and above the PROM level are impossible (Which is sometimes true).

The issue here isn't that I'm stuck at the PROM. That's no big deal and happens often. The problem is, every so often, I'll find myself stuck at the PROM on a Sun system type I've never worked on before. Depending upon how varied the PROM interface and commands are for that system, digging into my old bag of tricks my have me drawing up blanks left and right. The PROM "is" always very courteous about letting you know you got the command you typed in wrong and doesn't offer much in the way of help. For instance:

ok> show-me-the-money
show-me-the-money ?
ok>


The output from the PROM, letting you know it doesn't know what you're talking about, can vary also.

One of the most useful commands I've ever run across, at the PROM level, is called "sifting." Since you really have no help immediately available, and searching for particular commands in manuals or on the net can suck up huge amounts of time, this command is a life saver. Some folks refer to it as a "sifting dump" (the technical name), but the end result and usage are the same.

Best of all, the sifting command is incredibly easy to use (and the "one" command you should remember if you forget everything else ;). The command itself is incredibly simplistic. All it does is find all the OBP/PROM commands that contain the string you've specified as the only argument, and prints them out to the screen. That's it!

So, even if you've got your PROM chops down, when life throws you that curveball, you can still find the solution easily. Consider that you're working on a 4500 series server that hasn't been rebooted in 3 years. On boot up, everything seems to be going fine and, suddenly, you're presented with a poorly worded error about clock's being out of sync. What??? Assuming, still, that you have no access to any reference material of any kind on any medium, you can still figure this one out using sifting.

To continue, once that error pops up, you get dropped back to the PROM. Subsequent attempts to just boot again and hope the problem goes away have equal results. At this point, you know two things from reading the error itself: One clock on the system is out of sync with another clock, and/or the opposite. But how to synchronize them, since that would seem to be all the boot process wants you to do? This is where sifting becomes invaluable. Just like grep, the smaller the search string you provide, the more results you'll get. In this instance, we could do:

sifting clock

and we'd get back, among a few other things that can easily be dismissed:

copy-clock-tod-to-io-boards
copy-io-board-tod-to-clock-tod


I usually opt to copy the system clock to the I/O boards (copy-clock-tod-to-io-boards). My idea of best practices dictates that I'd go about this like so:

setenv auto-boot? false
reset-all
copy-clock-tod-to-io-boards
setenv auto-boot? true
boot


And, magically, the "clocks out of sync" error is gone!

The little example I showed you above is just the beginning. You can use sifting to find any command you need to know the exact name of (the PROM is so unforgiving ;). Try "sifting probe" if you can't remember what that probe command was that you needed to run, or "sifting net" if you've forgotten the exact name of "show-nets" or "test-net" (notice the seemingly arbitrary difference; one is plural and the other is singular). You can run sifting against a single character if you want to. Just get ready to have plenty of options ;)

Enjoy your exploration of the PROM commands available on your system of choice, and remember: sifting through the PROM will eventually lead you to the answer :)

, Mike





Wednesday, November 28, 2007

Disabling Network Devices in the Solaris Boot PROM

There are quite a few bugs out there on SunSolve, regarding system crashes due to this or that network device failure (or driver issues, depending on how you look at it). Most of the time, the advice is to disable the network device as non-destructively as possible. That can be difficult, given the right circumstances. Of course, I'm talking about the "non-destructive" part. Disabling an interface is easy. Lots of people who don't know the first thing about Solaris can do it, if they bang enough keys and have the proper system privilege ;)

The first thing to do is incredibly obvious. Just take out any references to the network device on the host. Say, if you are using ce0 and ce1 and no longer want to use ce1, you could just do the following and you'd be good from that point on and for future reboots:

ifconfig ce1 down
ifconfig ce1 unplumb
rm /etc/hostname.ce1
vi /etc/hosts
<--- Optional, to remove ce1 host name/IP entry
vi /etc/netmasks <--- Optional to remove ce1's netmask setting (unless it's on the same subnet as ce0!)
vi OTHER_FILES <--- Any pertinent files were you may have inserted special route commands, etc, that are no longer applicable

The second thing to look for would be OS operations you could perform (perhaps during boot up). As a "For Instance," certain Sun machines, running certain patch levels, have an issue with the hme0 network device (it's technically a device driver) if it's not connected to the network. Even if you aren't using it. This is somewhat annoying because, if you don't set up the configuration to plumb the network device and bring it up, it shouldn't give you any errors. You should only know hme0 exists by looking in a system file like /etc/path_to_inst. But the hme0 device causes the following error to post constantly during boot up and for a while after:

SUNW,hme0:Parallel detection fault

Working around this bug is fairly simple. In this case (and each case will probably be slightly different - troubleshooting can be long and hard some times), you could run the first two lines below, to stop the activity immediately, while logged in. The second line could be added to /etc/system so that the problem wouldn't recur on reboot. It is strongly recommended to backup, or copy off, /etc/system before changing it, so you can use "boot -a" at the PROM level to boot using your old version if the new one causes your future boots to fail!

ndd -set /dev/hme instance 0
ndd -set /dev/hme adv_autoneg_cap 0
set hme:hme_adv_autoneg_cap=0 >>/etc/system


The third, and most drastic, way you'd go about this is to disable the problematic network device at the Solaris PROM level. Before you bring the machine down, run this (we're still using hme0 as an example and note the output by copying and pasting into notepad, or even writing it down:

grep hme /etc/path_to_inst
"/pci@1f,4000/network@1,1" 0 "hme"
<--- This is the device (instance 0 of hme, or hme0) that we want to disable!
"/pci@1f,4000/SUNW,hme@5,1" 1 "hme"

Assuming we've already executed something like "init 0" as the root user, we could do the following to disable hme0 from the PROM "ok" prompt. Note that if you run "show-nets" at the PROM level and see truncated information, use it to compare with the device you have listed from before and use the most similar (just slightly clipped) entry in your future arguments.

If none of the patterns match at all, you may have gotten the wrong info from /etc/path_to_inst or your device tree is screwed up beyond what we're specifically dealing with here today. At this point you should be absolutely certain you know which network device to disable at the PROM level. Now, run the following to make the PROM disable, and Solaris forget all about, hme0:

ok nvedit
0: probe-all install-console banner
1: " /pci@1f,4000/network@1" $delete-device drop
2:
ctl-c
<--- Typing the control key and the c key together will break you out of nvedit and put you back at the ok prompt
ok nvstore
ok setenv use-nvramrc? true
use-nvramrc? = true
ok reset-all
<--- Make sure you enter a line with "set auto-boot? false" before you run this command, if you want your system to stay at the PROM after it resets.

Now, the hme0 network device should finally be completely disabled at the OS level. Solaris should not even know it exists! You may want to consider doing a reconfigure boot ("init 0" followed by "boot -r" at the PROM "ok" prompt or "reboot -- -r" from the OS -- there are a few more ways to do it, but I digress).

And, to answer the inevitable, and reasonable, question: How can I re-enable my network device on Solaris' PROM once I've disabled it?, here's how:

ok setenv use-nvramrc? false
ok nvedit
0: " /pci@1f,4000/network@1"
delete-device
ctrl-u
<--- Hitting the control key and u key together will delete the current line from the nvedit buffer. You only need to do this for the device you previously wanted to ignore.
ctrl-u <--- You usually won't have to type this twice. This is just to demonstrate that you can erase as many lines in the buffer as you want before exiting your nvedit session. In this case, you must delete two lines since the device and delete-device instruction are on separate lines.
ctrl-c
ok nvstore
ok reset-all
ok boot -r


Hopefully this has saved you more headaches than it can potentially cause :)

, Mike