Saturday, December 15, 2007

Shutting Down Domains on Sun 6800/6900 Servers

Here's a little information that comes in handy every once in a while. As you get used to working with Sun's larger machines, the convenience of some of the more advanced features becomes almost trivial. It's common (at least for me) to forget, from time to time, that ServerA and ServerB actually reside on the same physical server (Server1, for example).

That's one thing you get reminded of very quickly when, say, ServerB has a hardware related problem and you need to fix it. When you're dealing with DataCenter class machines, you generally don't want to make a mistake and accidentally pull a card that belongs to ServerA in your attempts to fix ServerB. The headache-multiplication theory is taken for granted, not to mention that company's generally throw lots of money at humongous hardware so they can house systems of "greater importance" on them. ServerA and ServerB, almost literally, translate into lost-revenue when they go down. When you're stuck in this sort of situation, taking your time and doing things right (even if you need to take a gut-punch and "read the manual" ;) is always more important than trying to blast your way through it and hoping for the best.

Luckily, when you're dealing with the 6800/6900 server series from Sun, dealing with domains and working on a "single" machine on a multi-domain system is pretty simple as long as you take the necessary precautions (never be embarrased to type "help"). Also, just so I can start typing 6800 instead of 6800/6900 from now on; the only real difference between the two is the internal architecture. The 6800's are SCSI-based, while the 6900's are fiber. You'll note that this is the difference with almost all of Sun's server series that relate closely (the v480 and v490, or the v880 and v890 - All of them are just slightly different). The 900's and 90's were released because internal fiber disk is much faster than internal scsi-connected disk. To keep it simple ;)

As a for instance, let's say that ServerB is suffering terrible failures (even Sun can't readily explain them). A number of HBA's on the I/O boards have failed and there's a problem with one of the System Boards. Also, your root mirror disk is giving off errors left and right. This is a potentially horrible scenario for which no resolution will be given. We're just using it to make it so we can walk through the process of bringing ServerB down completely and replacing parts.

The first thing you'll want to do is to connect to the System Controller. This can be done any number of ways. Your site should have documentation related to how they've set it up. Generally it will be an SSH or Telnet connection to the SC. You can also set up direct connects for the Domain Consoles, but, even when we have them, I prefer to connect to the SC, as you can get to all of the Domain Consoles from there, as well as the "Platform Shell"! Assuming you've connected, you'll be at a terminal screen that looks something like this.

System Controller 'Server1':

Type 0 for Platform Shell

Type 1 for domain A console
Type 2 for domain B console
Type 3 for domain C console
Type 4 for domain D console


Since this is also very specific to how your machine was set up, we'll go with the assumption that ServerA is on the "Domain A" console and ServerB is on the "Domain B" console. Since we want to work on ServerB, and leave ServerA up and running while we do, we'll type in the following (This may seem counter-intuitive at first, but the fact that I'm logging into the "Platform Shell" rather than the "Domain B" console can offer some enhanced control (when you get around to playing with it) and allows you to connect to any domain directly from it):

Input: 0

Platform Shell

Server1:SC>


Now we're at the SC prompt, at the Platform Shell level -- Remember, at almost any point along the way you can type "help" to get a list of all available commands. When you get a chance, do so, and you'll see what I mean about the enhanced flexibility that starting off at the Platform Shell offers. To continue, we'll connect directly to the "Domain B" console, which is just like logging into a regular machine serial console:

Server1:SC> console b

Connected to Domain B

ServerB console login: root
Password: ******


And, just like on any other machine, we'll bring it down to an ok> prompt as if it weren't a part of a larger physical organism (Server1 - the big 6800)

ServerB# init 0

You'll get the regular system messages and whatever else gets spit to the screen when you normally shut down, and you're there. Now, we'll want to switch from the "Domain Console" to the "Domain Shell." We can do that like so:

{c} ok

<---------- Here type a literal [ctl]+] (the control key and the right bracket (]) simultaneously) - this will get you to a Telnet or SSH prompt - depending on your setup. Then, you'll send a "break" signal to make the switch from Console to Shell.

telnet> send break

Domain Shell for Domain B - ServerB

ServerB:B> setkeyswitch off
<-- This command is the one that will "turn off" ServerB. Note that, if you've looked at the "help" output, you don't want to run "poweroff" - That could seriously ruin your day ;) The "poweroff" command is used for powering off the physical grids. Generally, on a two domain 6800, you'll only have one, so running "poweroff" might bring down both ServerB and ServerA. Sun only requires you to split your 6800 into 2 grids if you want to have 3 or 4 domains!

Powering boards off ...
ServerB:B>


Now your "virtual" server (ServerB) is off, and ServerA is still up and running as if nothing were going on. You're ready to begin replacing parts.

As a quick note; before you completely disconnect from the "Domain Shell," I always find it's good practice to run the following comand:

ServerB:B> showboards

Slot Pwr Component Type State Status Domain
---- --- -------------- ----- ------ ------
/N0/SB1 Off CPU Board Assigned Not tested B
/N0/SB2 Off CPU Board Assigned Not tested B
/N0/SB3 Off CPU Board Assigned Not tested B
/N0/IB7 Off PCI I/O Board Assigned Not tested B
/N0/IB9 Off PCI I/O Board Assigned Not tested B


Write down the left-most column (Slot) and glance over the entries to make sure that they're all in the correct Domain (B, here) and that the "Pwr" (power) column lists them all as off. This will help make doubly sure you don't accidentally affect ServerA, as the CPU Board and I/O Board numbers are listed on the outsides of the devices and, if you've written this information down, you can refer to it and easily locate what part of the system you can safely work with.

And, of course (very quickly) for those of you who want to know how to get everything back up and running, just do the following (A very quick summary of commands and output here, as the concepts are all the same, but done in a logical reverse order; with the exception of the rarely needed "resume" command noted below)

System Controller 'Server1':

Type 0 for "Platform Shell"

Type 1 for domain A console
Type 2 for domain B console
Type 3 for domain C console
Type 4 for domain D console

Input: 0

Platform Shell

Server1:SC>

Server1:SC> console b

Connected to Domain B


<---------- Here type a literal [ctl]+ ]

telnet> send break

Domain Shell for Domain B - ServerB

ServerB:B> setkeyswitch on
Powering boards on ...
ServerB:B>resume
<--- Note that this command and the following are not usually necessary. Once you power on your system by doing the "setkeyswitch on," the 6800 will run through extensive system tests and boot the OS directly.

ok> boot

Hopefully the amount of time spent reading this will save you much much more in the future :)

Cheers,

, Mike