Thursday, February 5, 2009

A Little VCS NFS Gotcha On Solaris 10

Hey again,

We're going back to the Solaris 10 Unix well, reaching back a little (as opposed to the 14 month reach-back we did yesterday ;) and adding a little something to our posts on adding NFS management to an existing VCS cluster as well as the follow up on how to do the exact same thing without taking your VCS cluster offline. Today's post is actually another little bit of fix-it knowledge to keep in the back of your hat (if that's even an expression anyone's ever used... if not, consider it © ® ™ us ;). And, of course, this piece of knowledge came to everyone by accident. Actually, by virtue of an accident... the answer was found methodically... I think ;)

In any event, after extensively field-testing the methods espoused in the earlier posts referenced above, a deployment of clustered servers to an offsite location ended up having an issue that we weren't able to anticipate (or cause to occur in previous cookie-cutter-similar deployments). For some reason, when we rolled this cluster out, NFS just refused to work in a failover capacity. Actually, it only failed, specifically, to allow the main node to mount on the failover node. This problem seems pedestrian (even still ;) - the only odd thing was that it had never happened before under equal circumstances.

Here's what we figured out along the way (and how to fix it, too ;) For our purposes today (and the way it was then) the NFS cluster component works fine on node-b, but node-a can't mount the NFS resource when it fails over to node-b.

1. The first thing most people do in any investigation is to see if the basic stuff is all up and running. We don't like to be different, so we duly checked that all of the required VCS resources were up and online. They were; which explained the puzzling ONLINE state ;)

2. We then proceeded to ensure that, in fact, node-b was sharing out the NFS resource. Commands like showmount indicated that it, indeed, was. A little research into the subject showed that the issue we ended up having can indicate an RPC failure at this point, as well, but it's best to try step 3, too, just to be sure the problem isn't confined to a single server (although the fix for it is the same no matter which way your story goes ;)

3. Then we finally struck gold, and got an actual error, when we tried to hit the mount from node-a:

node-a # showmount -e node-b
showmount: node-a: RPC: Rpcbind failure - RPC: Authentication error
node-a # rpcinfo -p node-b
rpcinfo: can't contact portmapper: RPC: Authentication error; why = Failed (unspecified error)

4. Unspecified errors are the best kind of errors you can get since there are a much wider variety of possible solutions you can come up with... Or, maybe I have that backwards... There's really not much more to step 4. This step is a practice in surrealism ;)

5. It turns out that the answer lay in setting rpcbind properties (away from the defaults on both servers). The answer to the problem (or the fix, if you will) actually makes more sense than the way things "usually" work. The first thing we did was to set rpcbind to "global" on both nodes. By default, it was set to "local_only." We double confirmed that this is still the case on other cluster setups we have running, in which everything is hunky-dory. You also need to do these steps on both nodes (or all nodes) in your cluster, while, here, we're only showing what we typed on the active NFS resource-sharing node:

node-b # svcprop network/rpc/bind:default | grep local_only <-- See if the local_only property is set
config/local_only boolean true <-- and there it is!

then move on to fixing the problem (again on both nodes) by setting the rpcbind configuration value to global (which, in the instance of rpcbind, actually means setting the local_only attribute to "false"):

node=b # svccfg
svc:> select network/rpc/bind
svc:/network/rpc/bind> setprop config/local_only=false
svc:/network/rpc/bind> quit

6. Then, just double check to make sure you've gotten it all set up correctly:

node-b # svcprop network/rpc/bind:default | grep local_only
config/local_only boolean true

...well, that's not right, but don't give up just yet! Keep typing. Type, Forrest, Type! ;)

node-b # svcadm refresh network/rpc/bind:default
node-b # svcprop network/rpc/bind:default | grep local_only
config/local_only boolean false

there... that's better.

7. Finally, just make sure you can mount your NFS resource from whichever node isn't currently hosting the NFS resource. You don't necessarily have to test it on both nodes, once you've fixed this issue on both, but why risk the near-future embarrassment?

node-a # showmount -e node-b
export list for node-b:
/our/shared/directory (everyone)

And that's that. You should be good to go :) Since all's well that ends well, we'll try not to leave you with any clichés in our farewell. Parting is, after all, such sweet sorrow. At least until tomorrow :)


, Mike

Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.