Friday, November 30, 2007

Cheap Batteries - CPU Boards, A1000's and a Bit More

Just a few random musings. Hopefully these will save you some money and some headache, like they did me.

1. If you ever have to replace the battery on your CPU Board for the Sun Series v480, v490, v880, v890 (Probably for more) servers and you have pay-per-incident support with Sun (or a VAR), take some down time on the server (you may "have" to) that's giving you a hard time. We're assuming that you've either talked with Sun's phone support or used a search engine to find out what the cryptic error message in your log file means, and all signs point to replacing the CPU board's battery.

Once you've gotten your outage period approved, power down the system and pull it out of the rack, or out on its rack slides (however you can get to the side so you can unscrew it on top and open it up on right side). Pull out the CPU board that is giving you the errors and check the battery. You'll notice that it looks a lot like the battery from your grandmother's hearing aid. That's because it is. Jot down the information on the battery (some will even say "Panasonic" on them) and head down to the corner drug store. For about 10 bucks you can get the battery you need to replace, and maybe a few extra. You just saved the company a few hundred dollars (don't forget to put this on your annual review, unless you're doing this without anyone's knowledge because it isn't condoned ;)

2. The A1000 and D1000 disk arrays are pretty much obsolete now. I'm positive they're past their EOL ("End Of Lifecycle") for sales and OS upgrades/patches, but lots of folks still use them because they're big and sturdy and, for the most part, reliable. The batteries on these machines "expire" every two years. I put "expire" in quotes for one simple reason; it's just a term. One that's misused about 98% of the time when it comes to the A1000 and D1000's.

The issue here is that the batteries, actually last up to 3 years before Sun considers them bad. Sun has actually released a patch (I don't have a link to it directly here - but you can ask a Sun, if you have paid support (even if you don't; you might get lucky). If they won't tell you, ask the FE they send out to do a service call on your A1000) that will make it so that your A1000 or D1000 won't complain until it's 3 years old, now. These complaints are also programmed, and can be modified. That is to say, they don't rely on anything other than their own internal state (assuming no "real" errors are being issued by the batteries). The error message you'll generally see is "Battery age is between 720 days and 810 days.," "Battery Age has Exceeded Specified Limit" or something like that. Not a big deal. You still have another year. If you pay for parts, you just saved the company even more money.

3. Sticking with the A1000's and D1000's; the batteries actually last longer than 3 years. In fact, you can say (with a fair amount of confidence) that the batteries are going to be okay until they throw an actual error - not a status report on their "age." As I mentioned above, these "error" messages about the age of the battery are determined by the RaidManager utility's knowledge of the battery's state. You can actually change this from the command line, as long as you're root, and make your batteries brand new again (In theory ;) like so:

xyz.com # raidutil -c c3t0d0s2 -B
Battery age is between 720 days and 810 days.
<--- Something to that effect. Sadly we don't use these anymore.
raidutil succeeded!

xyz.com # raidutil -c c3t0d0s2 -R
raidutil succeeded!

xyz.com # raidutil -c c3t0d0s2 -B
Battery is 0 Days Old
raidutil succeeded!


Now you can wait for something to actually "happen" before replacing the battery. Arguments can be made that it's not a good idea to wait for a problem to happen when you can fix it pre-emptively, but when the batteries die on these devices, all you've really lost is your read-ahead cache. If no one has noticed that lack of cache has slowed down performance in the period before you find out your battery's dead, its use as a speed-up for read/write operations was never really all that important anyway. So, for whatever amount of time you can squeeze out of that battery, you've saved your company even more money.

And one last tip, even though some folks frown on this (Sun, the company, in particular. Most Sun FE's are all right with it): It's perfectly okay to replace the battery on an A1000 while it's up and running. It's not, technically, hot-swappable like the disk, but you can remove the old battery, put in the new, and be confident things will work out okay. I've never ever crashed one by replacing the battery in the 4 or 5 years I worked on these devices. All you need to do is this (I promise I'm not lying):

Turn off cacheing, if it's enabled, with: raidutil -c TheDiskName -w off TheLunName
Remove the old battery
Replace it with the new battery
Run: "raidutil -c TheDiskName -R"
Turn cacheing back on, if you had to turn it off, with: raidutil -c TheDiskName -w on TheLunName
Wait about 15 minutes for the error state to clear.

No savings there; except maybe for your time ;)

Enjoy,

, Mike