Tuesday, February 24, 2009

M4/5000 XSCP Unintentional Denial Of Service

Hey There,

I'm not sure if this is "standard" yet, since I think I found a "bug/feature" in the Sun M4000 this weekend and had a pretty constant relationship with their support department during the investigation. Except for the fact that I had to work, it was a pretty decent experience ;)

We had a very strange situation happen that, in fact, completely crippled two of our M4000 servers. The XSCF cards on them experienced an "issue" and that meant that both servers were, for all intents and purpose, out of commission. And, I don't know if the phrase "for all intents and purposes" really does the situation justice. Those machine were actually completely inoperable. At least, in any useful sense. On the M4000 server, if the XSCF card isn't working correctly (or - while this didn't happen with us - removed) you have no way to boot the server up properly, much less get power to any of the domains the server runs. Yow! So, even though this wasn't our issue, always be extra careful not to jiggle that card loose, since it's directly accessible on the back and can be jostled around using the squeeze-grip attached to the outside.

The trouble all started when we had to Re-IP the XSCF cards on two servers. We'd never done this before and, although it's a practically painless experience, it does lead to one of the ways you can cause this denial of service (purely by accident, of course - please don't do this on purpose, even though it would very easy to. In fact - don't do it "because" it's so easy - no challenge - where's the reward? ;) Since we had been issued bad IP's in our Network build sheet, we had to Re-IP the XSCF's. Part of the process involved issuing the "rebootxscf" command which (with the configuration above) causes everything (or, at least all platadm and most basic commands) to begin failing strangely (errors on the command which show up as okay in the audit logs, inability to poweron domains, etc).

And this is where it got interesting. Through testing of various methods of power-cycling with this plugged into that and that plugged into this with the keyswitch in various positions, we could only find two ways to reproduce the error that made it seem like our M4000's were completely toast and would have to have the XSCF restored to defaults in order to access our domains that (hopefully) wouldn't get destroyed in the process (which they didn't).

Here are the two ways to accomplish the meltdown (followed by an explanation of the "why" - made, I might add, possible by the helpful folks at Sun tech support - I'm not selling anything, just thinking how a lot of times people who do the right thing don't get the recognition they deserve. Murderers, psychopaths and hate groups couldn't buy the free publicity they get in this country ;):

1. Setup either one, or both, of the Ethernet controllers that come standard on the single XSCF card. Then, after you've booted up and configured a domain (basically, made your M4000 usable and useful), remove the serial cable from the pseudo-serial port and replace it with straight Ethernet (connected to the same subnet). This will not cause you any problems. Things will continue to work as normal, except for a few seconds after you insert the Ethernet in the pseudo serial port when the XSCF readjusts to having an additional "path" to/from itself and the network. Now, ssh or telnet to the XSCF card and type:

XSCF> rebootxscf

you will be prompted that this will reset your system, which is (presumably) what you want. You'll note, within seconds, that your command prompt comes right back up and you never get disconnected. Now, even if you type something like "showdate," you'll get a "permission denied" error. Not to mention the fact that you can't run any other helpful commands like "showlog" or "rebootxscf"

2. Do the same thing you did before and power cycle your M4000. The M4000 doesn't really have an off switch, so the proper way to do this is to put the keyswitch into maintenance (or service) mode (the picture of the wrench ;) and pull the plugs to both power supplies. This will work even if you just pull the power with no regard for the state of your keyswitch. The machine will never fully come up. The amber "check" LED will go blank, but the flashing green "XSCF" LED will never stop flashing (This would usually mean that it was busy initializing). You should be able to ssh or telnet to your XSCF card, anyway, but the limitations on your usage will be as bizarre as in example 1 (As another for instance, you can create any user you want and give them all the administrative privileges you need to - assuming you have them already - but you can't do nslookup, power on/off/connect-to any domains and, as above, will not be able to run "rebootxscf.") All the restrictions, in both instances, are the same. There are so many goofy things to list, it would take a few pages.

HELPFUL HINT: If you get stuck in this situation and don't have physical access to your box, you won't be able to run "showlogs," but if you have auditadm privileges, you can run viewaudit, which can be helpful, but sometimes says that your commands are completing successfully when you get permission denied on the console and nothing happens ;)

Now, for the why as I understand it. The M8000/9000 series servers actually use the pseudo serial ports to connect primary and secondary (MASTER/SLAVE - Active/Standby) XSCF's for a redundant setup. Apparently, this functionality is going to be introduced into the M4000's. Actually, the possibility is already there (slots available, etc). So, in effect, what we did when we created this "crazy" situation was take advantage of a feature that isn't street-ready yet. Basically, once we attached the Ethernet cable to the pseudo serial port, we made the M4000 think it was the SLAVE XSCF in a redundant setup that doesn't (or maybe, at this point, CAN'T) exist. This knowledge pretty much explains all the insane error messages ;)

I hope this rundown, although drier than usual, is helpful to someone out there. Sunsolve, Sun support and Google couldn't find an answer for 3 days (Actually, I'm not sure if Sunsolve or Google have this yet - There are a few Sun FE's that know it now, though ;)

All in all, a bad experience that resulted in something good.

ONE LAST HELPFUL HINT: If you want to prevent this from ever accidentally happening (since it won't if the pseudo serial port isn't being used) one thing you can do is to plug a serial cable or dongle in the pseudo serial port. It doesn't have to be connected to anything and won't cause you any problems, but may discourage other folks from plugging anything else in there :)


, Mike

Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.