Tuesday, February 26, 2008

Offlining, Failing Over And Switching in VCS

Hey there,

Today we're going to address a question that's asked commonly enough by foks who use Veritas Cluster Server: What's the difference between offlining, failing over and switching when dealing with service groups across multiple nodes? That doesn't seem like a very common question. I'm sure it's probably hardly ever phrased that way ;)

Anyway, to the point, all three options above are useful (hence your ability to avail yourself of them) and sufficiently distinct that you should be sure you're using the one you want to, depending on what ends you wish to achieve. All of this information is fairly general and should work on VCS for Linux as well as Unix.

We'll deal with them one bye one, outlining what they basically do, with a short example command line to demonstrate, where applicable. None of this stuff is hard to pick up and run with, as long as all the components of your VCS cluster are setup correctly. Occasionally, you may see errors if things aren't exactly perfect, like we noted in this post on recovering indefinitely hung VCS resources.

1. Offlining. The distinction to be made here, most plainly, is that, when you offline a resource or service group, you are "only" doing that. This differs from failover and switching in that the service group you offline with this particular option is not brought online anywhere else as a result. So, when you execute, for instance:

host # hagrp -offline YOUR_SERVICE_GROUP -sys host1

that service group, and its resources, generally, are taken offline on host1, and nothing else happens. If you're operating in an environment where systems don't run service groups concurrently (active/active), you will have effectively "shut down" that service group, and any services it provides, for the entire cluster.

2. Failing Over. This is more of a concept than any particular command. When you have your cluster setup to fail over, if a resource or service group, etc, goes offline on one node (host1, for instance) and it wasn't brought offline on purpose, VCS will naturally attempt to bring it online on the next available node (listed in your main.cf configuration file). Needless to say, if only one host is listed for a particular service group, its failure on one host will mean the failure of the entire service group. It also obviates the use of VCS in the first place ;)

3. Switching. This is what most folks want to do when they "offline" a service group, as in point 1. Although, since VCS automatically switches unexpectedly offlined resources on its own (when it's set up to), it's reasonable for someone new to the product to assume that offlining a service group would engage VCS in a switching activity. Unfortunately, this isn't the case. If you want to switch a service group from host1 to host2, for example, these would be the command options you'd want to give to hagrp:

host # hagrp -switch YOUR_SERVICE_GROUP -to host2 <--- Assuming you're running this from host1.

Hopefully this little guided mini-FAQ helped out with differentiating between the concepts. If you find the command line examples valuable, even better :)

Happy failing! ;) Over.


, Mike