Wednesday, November 28, 2007

Disabling Network Devices in the Solaris Boot PROM

There are quite a few bugs out there on SunSolve, regarding system crashes due to this or that network device failure (or driver issues, depending on how you look at it). Most of the time, the advice is to disable the network device as non-destructively as possible. That can be difficult, given the right circumstances. Of course, I'm talking about the "non-destructive" part. Disabling an interface is easy. Lots of people who don't know the first thing about Solaris can do it, if they bang enough keys and have the proper system privilege ;)

The first thing to do is incredibly obvious. Just take out any references to the network device on the host. Say, if you are using ce0 and ce1 and no longer want to use ce1, you could just do the following and you'd be good from that point on and for future reboots:

ifconfig ce1 down
ifconfig ce1 unplumb
rm /etc/hostname.ce1
vi /etc/hosts
<--- Optional, to remove ce1 host name/IP entry
vi /etc/netmasks <--- Optional to remove ce1's netmask setting (unless it's on the same subnet as ce0!)
vi OTHER_FILES <--- Any pertinent files were you may have inserted special route commands, etc, that are no longer applicable

The second thing to look for would be OS operations you could perform (perhaps during boot up). As a "For Instance," certain Sun machines, running certain patch levels, have an issue with the hme0 network device (it's technically a device driver) if it's not connected to the network. Even if you aren't using it. This is somewhat annoying because, if you don't set up the configuration to plumb the network device and bring it up, it shouldn't give you any errors. You should only know hme0 exists by looking in a system file like /etc/path_to_inst. But the hme0 device causes the following error to post constantly during boot up and for a while after:

SUNW,hme0:Parallel detection fault

Working around this bug is fairly simple. In this case (and each case will probably be slightly different - troubleshooting can be long and hard some times), you could run the first two lines below, to stop the activity immediately, while logged in. The second line could be added to /etc/system so that the problem wouldn't recur on reboot. It is strongly recommended to backup, or copy off, /etc/system before changing it, so you can use "boot -a" at the PROM level to boot using your old version if the new one causes your future boots to fail!

ndd -set /dev/hme instance 0
ndd -set /dev/hme adv_autoneg_cap 0
set hme:hme_adv_autoneg_cap=0 >>/etc/system


The third, and most drastic, way you'd go about this is to disable the problematic network device at the Solaris PROM level. Before you bring the machine down, run this (we're still using hme0 as an example and note the output by copying and pasting into notepad, or even writing it down:

grep hme /etc/path_to_inst
"/pci@1f,4000/network@1,1" 0 "hme"
<--- This is the device (instance 0 of hme, or hme0) that we want to disable!
"/pci@1f,4000/SUNW,hme@5,1" 1 "hme"

Assuming we've already executed something like "init 0" as the root user, we could do the following to disable hme0 from the PROM "ok" prompt. Note that if you run "show-nets" at the PROM level and see truncated information, use it to compare with the device you have listed from before and use the most similar (just slightly clipped) entry in your future arguments.

If none of the patterns match at all, you may have gotten the wrong info from /etc/path_to_inst or your device tree is screwed up beyond what we're specifically dealing with here today. At this point you should be absolutely certain you know which network device to disable at the PROM level. Now, run the following to make the PROM disable, and Solaris forget all about, hme0:

ok nvedit
0: probe-all install-console banner
1: " /pci@1f,4000/network@1" $delete-device drop
2:
ctl-c
<--- Typing the control key and the c key together will break you out of nvedit and put you back at the ok prompt
ok nvstore
ok setenv use-nvramrc? true
use-nvramrc? = true
ok reset-all
<--- Make sure you enter a line with "set auto-boot? false" before you run this command, if you want your system to stay at the PROM after it resets.

Now, the hme0 network device should finally be completely disabled at the OS level. Solaris should not even know it exists! You may want to consider doing a reconfigure boot ("init 0" followed by "boot -r" at the PROM "ok" prompt or "reboot -- -r" from the OS -- there are a few more ways to do it, but I digress).

And, to answer the inevitable, and reasonable, question: How can I re-enable my network device on Solaris' PROM once I've disabled it?, here's how:

ok setenv use-nvramrc? false
ok nvedit
0: " /pci@1f,4000/network@1"
delete-device
ctrl-u
<--- Hitting the control key and u key together will delete the current line from the nvedit buffer. You only need to do this for the device you previously wanted to ignore.
ctrl-u <--- You usually won't have to type this twice. This is just to demonstrate that you can erase as many lines in the buffer as you want before exiting your nvedit session. In this case, you must delete two lines since the device and delete-device instruction are on separate lines.
ctrl-c
ok nvstore
ok reset-all
ok boot -r


Hopefully this has saved you more headaches than it can potentially cause :)

, Mike