Showing posts with label showboards. Show all posts
Showing posts with label showboards. Show all posts

Thursday, September 25, 2008

Replacing System Boards On Sun Mx000 Series Servers

Hey there,

Shifting gears again, today we're going to take a look at doing some hardware maintenance on Sun's (or, technically, Fujitsu's) new Mx000 series servers. At this point, I think there are only 4 variants available; the M4000, M5000, M8000 and M9000. The numbers relatively equate to how much "better" one is than the other, with the highest number being the best (This is a subjective point, though. Depending on your needs an M4000 may be much better for you than an M9000)

I wanted to take a look at Dynamic Reconfiguration (DR) on the Mx000 series, and this seemed like as good an example as any. One thing to keep in mind is that you can't do this on Midrange servers since the replacement of that system board means replacing a motherboard unit (MBU), which can't be done on-the-fly. Why does this matter? I don't know; the M4000 - M9000 are all Enterprise servers that support the DR we're going to do. Just some trivia to keep it interesting ;)

The first thing you'll want to do is to log into the XSCF shell (akin to the Domain Shell or System Controller that we looked at in our old posts on working with Sun 6800 and 6900 Series Servers).

After that, you'll need to check the status of the domain with the "showdcl" command. You just need to pass it one option ( -d ) to identify the domain you want to check out (note the similarities to the 6800/6900 server DR operation. A lot of the commands are identical. That's the last time I'll refer back to those humungous machines. I promise :)

XSCF> showdcl -d 0
DID LSB XSB Status
00 Running
00 00-0
01 01-0


Then, you'll need (or maybe just want) to check the status of the board that needs to be replaced. This can be done with the "showboards" (So familiar, but I promised not to go there anymore ;) command.

It's important to note that, if the board (itself) doesn't support the DR board deletion command, then - even if you're on an Enterprise system that supports DR - you won't be able to use DR to replace the board. Disregarding other, more eccentric, problems that rarely happen (outside the scope of this post), the thing to look for here is under the "Assignment" column. If a board shows as "Assigned", and meets all the other criteria too Byzantine and awkward to expound upon; this fits the definition of "doesn't support DR board deletion," mentioned above. You'll know for sure that it doesn't work when the command fails (which is another good reason to take an outage no matter how "resilient" your hardware uptime solution is). This is a very easy problem to fix, however. All you usually need to do is add one step before the next (to "unassign" the board) and it will magically support DR board deletion :) We'll group it in with the next step, just to keep things neat and tidy :)

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y y y Passed Normal


Now, we'll delete the system board using the following command

XSCF> deleteboard -c disconnect 01-0


Note, that if the board is Assigned, and doesn't support DR, you'll need to run this variant of the "deleteboard" command before the one above (to unassign it). Note, also, that it doesn't hurt to do this even if the board "does" support DR:

XSCF> deleteboard -c unassign 01-0


No sweat :)
Now, you'll want to check the status of "showboards" again (We're going to pretend that the "Assigned" status is OK, like it usually is, from now on)

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y n n Passed Normal


You'll notice here, now, that the Conn (Connected) and Conf (Configured) columns are showing n (no). This is good since you've deleted the board (logically) from the domain configuration.

Next, you'll need to get your hands dirty and physically replace the board. Actually, you probably won't if you've purchased Sun support (or wear a good sturdy pair of pleather gloves ;), since Sun won't let you touch it if you want them to come back out, at no additional charge, ever again should something actually be "wrong" with the replacement board they send you. We won't go into the boring details of hot-replacing the board, since it's (again) outside the scope of this increasingly long post, and should be performed by a Sun FE if you have no idea how to do it!

Once that's all over with, simply type

XSCF> replacefru


to complete the software part of replacing the "field replaceable unit," and check the status of the system board again. This time, also run:

XSCF> showboards -d 0


to ensure that all the system boards are still registered in the DCL (Domain Components List - Basically a list of all the boards that make up the domain - domain 0 in your case today)

If the system board configuration has changed (like the division type has changed from Uni to Quad for some reason... like you figured out a way to sneak in a system upgrade or something ;), you may need to run the "setupfru" command. You most likely won't, since you're replacing your board with another board that's exactly the same as the old board, except it works ;)

If the replacement system board isn't registered in the DCL, double check to make sure it hasn't assigned itself to a different domain (I've never seen this happen) using:

XSCF> showboards -v -a


In any event, since it's not in the DCL for your domain, you'll just need to add it back by running:

XSCF> setdcl -d 0 -l 01


The -d flag is for the domain and the -l is for the LSB number (listed in your "showboards" output).

Now, you should be on the road to all-the-way-good. But you should check and make sure, just in case:

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y n n Passed Normal


Now, you'll want to check the status of the domain (basically to determine if you want to reboot it or not, which you don't or you'll be directly contradicting everything DR stands for ;)

XSCF> showdcl -d 0
DID LSB XSB Status
00 Running
00 00-0
01 01-0


and then, finally, you'll add the "new" board back to the domain and "configure" it, as well ("adding" will set the Conn column to y and "configuring" will set the Conf column to y).

XSCF> addboard -c configure -d 0 01-0


Then (and you're almost done - just being really cautious...) check the domain component list status again to make sure everything's cool:

XSCF> showdcl -d 0
DID LSB XSB Status
00 Running
00 00-0
01 01-0


and run "showboards" on that new board to make sure everything is peachy ( The words Assigned, Passed, Normal and a few letter y's are excellent indicators that things are all well :)

XSCF> showboards 01-0
XSB DID(LSB) Assignment Pwr Conn Conf Test Fault
-----------------------------------------------------------------
01-0 00(01) Assigned y y y Passed Normal


Congratulations! You've just completed your DR system board replacement on an M4000, 5000, 8000 or 9000. Now that you know how to do it, re-read these instructions and be amazed that it actually takes you longer to plod through this post than it does to do an actual board replacement ;)

For further perusal, enjoyment and possible confusion, check out The Official DR User's Guide For The Mx000 Series and The Mx000 Server Glossary. They're both fascinating reads that double as powerful sleep-aids ;)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Saturday, February 9, 2008

Moving Boards Between Domains On Sun 6800/6900 Servers

Hey There,

For some light weekend fare. I thought I'd put together this quick little tutorial on the simple task of moving boards (CPU, IO) between domains on Sun 6800/6900 servers. For our purposes, we'll assume that the boards are already installed on the physical system and that we want to move one CPU and one I/O board from Domain C to Domain A.

The first thing you want to do is connect to the System Controller and then attach to the Platform Shell. This is basically necessary since you'll be working on two separate Domains. The Platform Shell will allow you to connect to each and go back and forth if you need to.

System Controller 'host-sc0':

Type 0 for Platform Shell

Type 1 for domain A console
Type 2 for domain B console
Type 3 for domain C console
Type 4 for domain D console

Input: 0

Platform Shell

host-sc0:SC>


The first thing you'll want to do is show the board assignments, to make sure you're moving the right devices. You can do this simply with the showboards command below (output trimmed):

host-sc0:SC> showboards

Slot Pwr Component Type State Status Domain
---- --- -------------- ----- ------ ------
/N0/SB0 Off CPU Board Assigned Not tested C
/N0/SB1 Off CPU Board Assigned Not tested C
/N0/SB2 Off CPU Board Assigned Not tested A
/N0/SB3 Off CPU Board Assigned Not tested A
/N0/SB4 Off CPU Board Assigned Not tested C
/N0/SB5 Off CPU Board Assigned Not tested C
/N0/IB6 Off PCI I/O Board Assigned Not tested C
/N0/IB7 Off PCI I/O Board Assigned Not tested C
/N0/IB8 Off PCI I/O Board Assigned Not tested C
/N0/IB9 Off PCI I/O Board Assigned Not tested A


Now assuming we wanted to move IO Board 7 and CPU Board 1 from Domain C to Domain A, you can do so, also simply, using the deleteboard and addboard commands:

host-sc0:SC> deleteboard IB7
host-sc0:SC> deleteboard SB1
host-sc0:SC> addboard -d A IB7
host-sc0:SC> addboard -d A SB1


Then just double check :)

host-sc0:SC> showboards

Slot Pwr Component Type State Status Domain
---- --- -------------- ----- ------ ------
/N0/SB0 Off CPU Board Assigned Not tested C
/N0/SB1 Off CPU Board Assigned Not tested A
/N0/SB2 Off CPU Board Assigned Not tested A
/N0/SB3 Off CPU Board Assigned Not tested A
/N0/SB4 Off CPU Board Assigned Not tested C
/N0/SB5 Off CPU Board Assigned Not tested C
/N0/IB6 Off PCI I/O Board Assigned Not tested C
/N0/IB7 Off PCI I/O Board Assigned Not tested A
/N0/IB8 Off PCI I/O Board Assigned Not tested C
IB9 Off PCI I/O Board Available Not tested A


And repeat as necessary. When you boot up the separate domains (The logical machines) you'll enjoy the benefit of the new hardware. Easy peasy :)

Cheers,

, Mike




Saturday, December 15, 2007

Shutting Down Domains on Sun 6800/6900 Servers

Here's a little information that comes in handy every once in a while. As you get used to working with Sun's larger machines, the convenience of some of the more advanced features becomes almost trivial. It's common (at least for me) to forget, from time to time, that ServerA and ServerB actually reside on the same physical server (Server1, for example).

That's one thing you get reminded of very quickly when, say, ServerB has a hardware related problem and you need to fix it. When you're dealing with DataCenter class machines, you generally don't want to make a mistake and accidentally pull a card that belongs to ServerA in your attempts to fix ServerB. The headache-multiplication theory is taken for granted, not to mention that company's generally throw lots of money at humongous hardware so they can house systems of "greater importance" on them. ServerA and ServerB, almost literally, translate into lost-revenue when they go down. When you're stuck in this sort of situation, taking your time and doing things right (even if you need to take a gut-punch and "read the manual" ;) is always more important than trying to blast your way through it and hoping for the best.

Luckily, when you're dealing with the 6800/6900 server series from Sun, dealing with domains and working on a "single" machine on a multi-domain system is pretty simple as long as you take the necessary precautions (never be embarrased to type "help"). Also, just so I can start typing 6800 instead of 6800/6900 from now on; the only real difference between the two is the internal architecture. The 6800's are SCSI-based, while the 6900's are fiber. You'll note that this is the difference with almost all of Sun's server series that relate closely (the v480 and v490, or the v880 and v890 - All of them are just slightly different). The 900's and 90's were released because internal fiber disk is much faster than internal scsi-connected disk. To keep it simple ;)

As a for instance, let's say that ServerB is suffering terrible failures (even Sun can't readily explain them). A number of HBA's on the I/O boards have failed and there's a problem with one of the System Boards. Also, your root mirror disk is giving off errors left and right. This is a potentially horrible scenario for which no resolution will be given. We're just using it to make it so we can walk through the process of bringing ServerB down completely and replacing parts.

The first thing you'll want to do is to connect to the System Controller. This can be done any number of ways. Your site should have documentation related to how they've set it up. Generally it will be an SSH or Telnet connection to the SC. You can also set up direct connects for the Domain Consoles, but, even when we have them, I prefer to connect to the SC, as you can get to all of the Domain Consoles from there, as well as the "Platform Shell"! Assuming you've connected, you'll be at a terminal screen that looks something like this.

System Controller 'Server1':

Type 0 for Platform Shell

Type 1 for domain A console
Type 2 for domain B console
Type 3 for domain C console
Type 4 for domain D console


Since this is also very specific to how your machine was set up, we'll go with the assumption that ServerA is on the "Domain A" console and ServerB is on the "Domain B" console. Since we want to work on ServerB, and leave ServerA up and running while we do, we'll type in the following (This may seem counter-intuitive at first, but the fact that I'm logging into the "Platform Shell" rather than the "Domain B" console can offer some enhanced control (when you get around to playing with it) and allows you to connect to any domain directly from it):

Input: 0

Platform Shell

Server1:SC>


Now we're at the SC prompt, at the Platform Shell level -- Remember, at almost any point along the way you can type "help" to get a list of all available commands. When you get a chance, do so, and you'll see what I mean about the enhanced flexibility that starting off at the Platform Shell offers. To continue, we'll connect directly to the "Domain B" console, which is just like logging into a regular machine serial console:

Server1:SC> console b

Connected to Domain B

ServerB console login: root
Password: ******


And, just like on any other machine, we'll bring it down to an ok> prompt as if it weren't a part of a larger physical organism (Server1 - the big 6800)

ServerB# init 0

You'll get the regular system messages and whatever else gets spit to the screen when you normally shut down, and you're there. Now, we'll want to switch from the "Domain Console" to the "Domain Shell." We can do that like so:

{c} ok

<---------- Here type a literal [ctl]+] (the control key and the right bracket (]) simultaneously) - this will get you to a Telnet or SSH prompt - depending on your setup. Then, you'll send a "break" signal to make the switch from Console to Shell.

telnet> send break

Domain Shell for Domain B - ServerB

ServerB:B> setkeyswitch off
<-- This command is the one that will "turn off" ServerB. Note that, if you've looked at the "help" output, you don't want to run "poweroff" - That could seriously ruin your day ;) The "poweroff" command is used for powering off the physical grids. Generally, on a two domain 6800, you'll only have one, so running "poweroff" might bring down both ServerB and ServerA. Sun only requires you to split your 6800 into 2 grids if you want to have 3 or 4 domains!

Powering boards off ...
ServerB:B>


Now your "virtual" server (ServerB) is off, and ServerA is still up and running as if nothing were going on. You're ready to begin replacing parts.

As a quick note; before you completely disconnect from the "Domain Shell," I always find it's good practice to run the following comand:

ServerB:B> showboards

Slot Pwr Component Type State Status Domain
---- --- -------------- ----- ------ ------
/N0/SB1 Off CPU Board Assigned Not tested B
/N0/SB2 Off CPU Board Assigned Not tested B
/N0/SB3 Off CPU Board Assigned Not tested B
/N0/IB7 Off PCI I/O Board Assigned Not tested B
/N0/IB9 Off PCI I/O Board Assigned Not tested B


Write down the left-most column (Slot) and glance over the entries to make sure that they're all in the correct Domain (B, here) and that the "Pwr" (power) column lists them all as off. This will help make doubly sure you don't accidentally affect ServerA, as the CPU Board and I/O Board numbers are listed on the outsides of the devices and, if you've written this information down, you can refer to it and easily locate what part of the system you can safely work with.

And, of course (very quickly) for those of you who want to know how to get everything back up and running, just do the following (A very quick summary of commands and output here, as the concepts are all the same, but done in a logical reverse order; with the exception of the rarely needed "resume" command noted below)

System Controller 'Server1':

Type 0 for "Platform Shell"

Type 1 for domain A console
Type 2 for domain B console
Type 3 for domain C console
Type 4 for domain D console

Input: 0

Platform Shell

Server1:SC>

Server1:SC> console b

Connected to Domain B


<---------- Here type a literal [ctl]+ ]

telnet> send break

Domain Shell for Domain B - ServerB

ServerB:B> setkeyswitch on
Powering boards on ...
ServerB:B>resume
<--- Note that this command and the following are not usually necessary. Once you power on your system by doing the "setkeyswitch on," the 6800 will run through extensive system tests and boot the OS directly.

ok> boot

Hopefully the amount of time spent reading this will save you much much more in the future :)

Cheers,

, Mike