Friday, December 14, 2007

Why Horrible Sun Boot Problems Aren't Always All That Bad

I had an experience at work recently that had me shaking me head (and wishing I'd left for home a few minutes earlier ;) One of our v490 servers, that was already racked,cabled up and ready to have the OS built and put on the network the following day, decided it just wasn't going to boot up; not even to an ok> prompt!

Without the keyswitch set to run extended diagnostics, the situation looked pretty severe. This is about all I saw before it would power back down to nothing:

1:0>Waiting for master in slave_spin() CPU=0:0, timeout in 29 seconds...
2:0>Waiting for master in slave_spin() CPU=0:0, timeout in 29 seconds...
3:0>Waiting for master in slave_spin() CPU=0:0, timeout in 29 seconds...
1:0>
1:0>ERROR: TEST = Slave Spin
1:0>H/W under test = CPU, Motherboard/Centerplane, I/O board, (system init)
1:0>Repair Instructions: Replace items in order listed by 'H/W under test' above.
1:0>MSG = ERROR :Timeout waiting for master, doing re-config reset.
1:0>END_ERROR


And "nothing!" Anyway, as is our company's policy, I placed a call to Sun Support and their suggestion, as is suggested plainly by the error above, was to have a Field Engineer come out and replace the CPU boards (including the CPU's and memory - which is actually faster), and if that didn't work, replace the motherboard, the centerplane and the I/O board, progressively, until the error went away. You can see why I wasn't too happy, right? We're talking about a potential 10 extra hours of work doing parts replacement, followed by diagnostics, followed by possible extra parts orders, replacements, diagnostics, add infinitum (if not ad naseum ;)

Here's the kicker. After hooking up a laptop to the ALOM port, we started the system up with extended diagnostics. It wasn't looking much better. In fact, it gave a lot of confusing errors, like (and I'm paraphrasing here, because I stopped logging my diag output after a while):

FATAL ERRORS:
This version of v490/890 servers only support Ultra IV Processors
CPU's Online:
cpu #0 - Ultra IV 1500
cpu #2 - Ultra IV 1500


What?? That seemed contradictory to me. So we did what isn't generally a good idea (unless your machine appears to be in a state of complete ruination anyway) and pulled the plug, let it idle and powered it back on with the diagnostic keyswitch set. This time it gave us a little more information, and - lo and behold - in between the thousands of diagnostic messages (in between the FATAL ERRORS and the "slave_spin" errors) this line popped up:

OBP/Flash version 4.16.4 does not support part number ##### (Which happened to be the part number of both of our CPU boards).

This was great news! But how to fix it? Of course, replacing the centerplane (which, if you've ever done it - or even watched it being done - understand that it can be a painstaking and extended process) would fix the problem. On the v490 server, the OBP resides on the centerplane, so that was one option (If we'd have followed Sun's advice, of course, we would have already gone through replacing both CPU boards and, possibly, the motherboard before getting to that point!)

Our system OBP/Flash version was 4.16.4, and for the 1500 CPU - Ultra IV CPU boards, we needed to be up to OBP/Flash version 4.18.1. Clearly the CPU boards had been put in the v490 without regard to whether or not they were actually compatible ;)

Our next step was to take an old CPU board and replace the two new ones with it (just to test) and, magically, the machine booted perfectly. None of the system components listed were in a state of failure, or on their way to failing. The 1350 CPU board we put in only required OBP/Flash version 4.15.6 to be supported, and our centerplane OBP exceeded that level.

Our options boiled down to, as we saw it then, installing the OS on disk while we had the one 1350 CPU board installed, downloading the latest OBP/Flash and installing it, and then shutting down and booting up with the two new 1500 CPU Ultra IV boards (While this was a perfectly workable solution, it seemed like there must be a faster way to do it). Net booting was also an option, but that would require modifying our net boot server and might also cause other unforeseen complicatons. We also didn't want to have to have Sun replace the centerplane, as this wasn't any more guaranteed to work than our system-install method.

We eventually ended up bringing a Sun FE on site and got the surprise of our lives (or at least our present days ;) Luckily, Sun FE's have a CD/DVD (So far as I know, it's been around for about a year and is only available to Sun personnel) called SUE (which stands for Sun Utility Environment - or something like that - I was sneaking peaks). This is a tool that's time came a long while ago. With it, the FE was able to boot us to the ok> prompt (using the 1350 CPU board) and run the OBP/Flash upgrade directly from CD!

That's stretching the truth somewhat - SUE actually creates a mini-boot environment in on-board memory and sets up a temporary alias so that you can reboot and upgrade the OBP/Flash. So, instead of having to install the OS, boot the machine into network mode, download the latest OBP/Flash and then reboot with the new flash file, like so (somewhat abbreviated):

init 0
ok> boot disk /flash-update-v490
<--- or whatever the OBP/Flash upgrade file was called.

We were able to update the OBP/Flash by just booting off of the SUE CD, picking the OBP/Flash upgrade from the list available on the CD and letting it do a :

reboot -- cdrom /flash-update-v490

That was a "huge" time savings! Hopefully, Sun will make this CD, or a CD utility like it, available to users (or, at least, contract holders) in the near future.

So, as it turned out, that absolutely horrible boot problem wasn't really all that bad. Rather than replacing every single piece of hardware on the system until we found the one that was bad, all we had to do was upgrade the OBP/Flash on the system!

Sometimes the most complicated problems have the simplest solutions :)

Best wishes,

, Mike