Wednesday, June 3, 2009

Troubleshooting NetBackup Error 6 Failures More Easily

Hey there,

Here's another one for all the NetBackup aficionados out there (in a long list of posts, including our last one on solving NetBackup issues through sheer exhaustion.

This post probably won't be interesting unless you're having this problem or have had it in the past. Also, if you're a woman who is, or may become, pregnant (which I alwasy figured was a polite of saying "not barren"), reading this post may result in a rare but fatal side effect ;) ...Sorry, but I just love pharmaceutical disclaimers in advertising these days. Some guy who's got a severe rash, the sweats, heart palpitations, sudden uncontrollable flatulence, nausea, dizziness and vertigo is actually happy to take on all those side-effects just so he won't get the sniffles anymore. I'm not that positive. I'll deal with my allergies if it means I don't have to wear a colostomy bag ;)

Anyway, on to the meat of the post:

This has probably happened to most folks who've had to set up one of many aspects of NetBackup (new tape drives, new tape robots, new policies, new backup clients, etc). It's the dreaded "Error (6)" which Veritas/Symantec describes as:

host # /usr/openv/netbackup/bin/admincmd/bperror -S 6
the backup failed to back up the requested files
Errors caused the user backup to fail.

This is a perfectly reasonable error message and is probably true for every backup you've ever had that didn't work ;) And, for that very same reason, it can be a headache. Luckily NetBackup is good enough to provide other, more specific, error messages (in the hundreds) to make sure this error generally only applies to fairly specific situations.

You can check out this Veritas Support Page for NetBackup Error 6 that's a good starting point. It covers a bit of what we'll cover here and was actually helpful in nudging me in the right direction on this occasion.

If you ever do get stuck with that error, here's a pretty good way to resolve your issue (no matter what type of policy you're using - Standard, Oracle, etc):

NOTE: All examples are from Solaris 10. While the file locations on most Linux and Unix distro's are exactly the same, things may not be where my examples show they are (...another disclaimer ;)

1. Turn on verbose logging right off the bat. You can start out with specific logging, if you're pretty sure you know what the issue is, but, assuming you're totally lost, you may as well get all the information you can from NetBackup. You can turn off the logging as soon as you've figured out what's wrong and fixed it, by simply undoing what you do in this step:

a. Create the following log directories (some are not in the provided "mklogdir" script to create logging directories) in your NetBackup directory (/usr/openv/netbackup):

host # cd /usr/openv/netbackup
host # mkdir admin bpcompatd bpjava-msvc bpsynth symlogs bpbrm bpdbm bpjava-susvc bptm user_ops bpbrmds bpdm bpjobd vnetd bpcd bphdb bprd vopied

b. Then add the following line to your bp.conf file (no restart of NetBackup is necessary for these changes in logging to take effect):

host # echo "VERBOSE = 1" >>/usr/openv/netbackup/bp.conf

2. Now, kick off the job that's constantly failing with the 6 error (do it via the GUI or using bpadm selecting m, browsing to the policy you need to backup, and then typing i to initiate the manual backup)

3. Once everything fails, you'll have some reading to do. For our purposes we'll assume that your backup policy kicks off a script to run some RMAN backups for Oracle (why not?). The first place to look is in /usr/openv/netbackup/logs/bpcd, and read the log with today's date tagged on the end (probably the only log if you just created these dirs, but maybe not). It's also sometimes easier to just tail this file while you kick off your backup:

NOTE: This output is from one job that fails immediately with error 6. It seems like a lot of nonsense and (in most circumstances) you really don't "need" to pay attention to it all. For instance, if you're getting an error 6, you don't have to worry about the server name comparison, generally, since there are several other error codes that deal with that. For all of its bluster, the only thing we really care about, with regards to error 6 (assuming it's not a canard) is the last line (in italics)

host # cd /usr/openv/netbackup/logs/bpcd
host # tail -100 `ls -1tr|tail -1`
16:47:11.062 [3213] <2> bpcd main: offset to GMT 18000
16:47:11.062 [3213] <2> logconnections: BPCD ACCEPT FROM TO
16:47:11.062 [3213] <2> bpcd main: setup_sockopts complete
16:47:11.063 [3213] <2> bpcd peer_hostname: Connection from host vhost1 ( port 37567
16:47:11.063 [3213] <2> bpcd valid_server: comparing vhost1 and vhost1
16:47:11.064 [3213] <4> bpcd valid_server: hostname comparison succeeded
16:47:11.064 [3213] <2> bpcd main: output socket port number = 1
16:47:11.076 [3213] <2> bpcd main: Duplicated vnetd socket on stderr
16:47:11.076 [3213] <2> bpcd main: <---- NetBackup 6.5 0 ------------initiated
16:47:11.076 [3213] <2> bpcd main: VERBOSE = 0
16:47:11.076 [3213] <2> bpcd main: Not using VxSS authentication with vhost1
16:47:11.078 [3213] <2> bpcd main: BPCD_FORK_CMD_RQST
16:47:11.078 [3213] <2> bpcd main: fork cmd = /usr/openv/netbackup/bin/bptm bptm -unload -dn Drive001 -dp /dev/rmt/0cbn -dk 2000027 -m C03426 -mk 4000033 -mds 8 -alocid 34 -jobid -1243881659 -jm
16:47:11.078 [3213] <2> bpcd main: filter type = -1
16:47:11.078 [3213] <2> bpcd main: backuptime before fork = 0
16:47:11.080 [3213] <2> bpcd exit_bpcd: exit status 0 ----------->exiting
16:47:11.082 [3215] <2> bpcd main: Converting args string into an argv
16:47:11.083 [3215] <2> bpcd main: Done converting args string into an argv
16:47:11.083 [3215] <2> bpcd main: child_args[0] = /usr/openv/netbackup/bin/bptm
16:47:11.083 [3215] <2> bpcd main: child_args[1] = bptm
16:47:11.083 [3215] <2> bpcd main: child_args[2] = -unload
16:47:11.083 [3215] <2> bpcd main: child_args[3] = -dn
16:47:11.083 [3215] <2> bpcd main: child_args[4] = Drive001
16:47:11.083 [3215] <2> bpcd main: child_args[5] = -dp
16:47:11.083 [3215] <2> bpcd main: child_args[6] = /dev/rmt/0cbn
16:47:11.083 [3215] <2> bpcd main: child_args[7] = -dk
16:47:11.083 [3215] <2> bpcd main: child_args[8] = 2000027
16:47:11.083 [3215] <2> bpcd main: child_args[9] = -m
16:47:11.083 [3215] <2> bpcd main: child_args[10] = C03426
16:47:11.083 [3215] <2> bpcd main: child_args[11] = -mk
16:47:11.083 [3215] <2> bpcd main: child_args[12] = 4000033
16:47:11.083 [3215] <2> bpcd main: child_args[13] = -mds
16:47:11.083 [3215] <2> bpcd main: child_args[14] = 8
16:47:11.083 [3215] <2> bpcd main: child_args[15] = -alocid
16:47:11.083 [3215] <2> bpcd main: child_args[16] = 34
16:47:11.083 [3215] <2> bpcd main: child_args[17] = -jobid
16:47:11.083 [3215] <2> bpcd main: child_args[18] = -1243881659
16:47:11.083 [3215] <2> bpcd main: child_args[19] = -jm
16:47:11.083 [3215] <2> bpcd main: Before execvp of command

The "Before execvp of command" line that ends the output usually means that your NetBackup command either ran out of time before it started (not likely, but maybe depending upon your setup - different systems in different time zones might cause this error, or very very very slooow machines) or that it failed to execute because one of the files in the policy doesn't exist. There are plenty of other reasons this could happen, but the latter reason, above, is generally it.

4. Once you've determined that your policy is trying to back up a file that either doesn't exist (or it doesn't have permission to access), the next step is to determine what that file is. The "bphpd" directory we created before will give you very specific information with regards to this as long as you have VERBOSE logging enabled and have created the directory yourself (it doesn't get created by the "mklogdir" script). If you've just created the directory and run your job, there will be three files present (one for stdout, one for stderr and one for both. I always use the one with both - log.060209 in this instance):

host # cd /usr/openv/netbackup/logs/bphpd
host # ls
log.060209 obk_stderr.060209 obk_stdout.060209

Checking out this log can bring you to a resolution very quickly. For instance, in the example for today, our policy was attempting to run an Oracle script (that may very well have had problems itself - you'd go looking at that to make sure it worked at all if you couldn't find the problem in here) that had been entered into the policy with an incorrect name (Doh!):

host # cat log.060209
14:53:38.341 [11086] **LOCALE ERROR** locale not found in file
14:53:38.342 [11086] <2> logparams: -sb -rdbms oracle -S host1 -to 300 -c ORACLE_HOT_BACKUP -s piping_hot_backup -clnt vhost1 -FULL -kl 28 -b vhost1_1243968817 -jobid 135 -shm
14:53:38.342 [11086] <4> bphdb main: INF - setenv NB_ORA_POLICY=ORACLE_HOT_BACKUP
14:53:38.342 [11086] <4> bphdb main: INF - setenv NB_ORA_CLIENT=vhost1
14:53:38.342 [11086] <4> bphdb main: INF - setenv NB_ORA_MODE=B
14:53:38.342 [11086] <4> bphdb main: INF - NB_ORA_POLICY=ORACLE_HOT_BACKUP
14:53:38.342 [11086] <4> bphdb main: INF - NB_ORA_SCHED not defined.
14:53:38.342 [11086] <4> bphdb main: INF - NB_ORA_PC_SCHED=piping_hot_backup
14:53:38.342 [11086] <4> bphdb main: INF - NB_ORA_SERV=host1
14:53:38.342 [11086] <4> bphdb main: INF - NB_ORA_PC_BTYPE not set
14:53:38.342 [11086] <4> bphdb main: INF - setenv NB_ORA_FULL=1
14:53:38.342 [11086] <4> bphdb main: INF - setenv NB_ORA_INCR=0
14:53:38.343 [11086] <4> bphdb main: INF - setenv NB_ORA_CINC=0
14:53:38.343 [11086] <4> bphdb main: INF - setenv NB_ORA_SCHEDULED=1
14:53:38.343 [11086] <4> bphdb sync_server: INF - BACKUP START
14:53:38.343 [11086] <4> bphdb sync_server: INF - CONTINUE BACKUP message received.
14:53:38.343 [11086] <2> bphdb get_filelist: INF - Read filename:
14:53:38.343 [11086] <2> bphdb get_filelist: INF - Read filename:
14:53:38.343 [11086] <4> bphdb do_backup: INF - Processing /usr/local/bin/
14:53:38.345 [11086] <4> bphdb keepalive_timeout: INF - bphdb still working.
14:53:38.345 [11086] <2> bphdb keepalive_timeout: INF - bphdb- Sending a keepalive.
14:53:38.345 [11086] <4> bphdb do_backup: INF - Keepalives will be sent every 150 seconds.
14:53:38.345 [11086] <4> bphdb do_backup: INF - Waiting for the child status.
14:53:38.347 [11087] <4> bphdb do_backup: INF - Child executing /usr/local/bin/
14:53:38.347 [11087] <32> bphdb do_backup: FTL - Child exec of /usr/local/bin/ failed.
14:53:38.347 [11087] <32> bphdb do_backup: errno = 2: No such file or directory
14:53:38.347 [11087] <16> bphdb Exit: ERR - bphdb exit status = 29: failed trying to exec a command

and there's a lot more where that came from, but the final lines (that we're showing above) indicates our issue very clearly. The file "/usr/local/bin/" does not exist. In fact, the file name is supposed to be "/usr/local/bin/" (We all make mistakes when we're overworked, underfed and sleep deprived. Lots of people make mistakes while they're operating at 100% capacity. There's no shame in it ;)

5. Changing that attribute in our policy, and rerunning the job results in success! Now it's time to undo everything we did in step 1, assuming we don't have a lot of space for useless debug logs (useless only, of course, if everything is okay, for the most part... I smell another disclaimer coming on ;)

Here's the kicker. While this may seem like a very simplistic breakdown of what's certainly a more complex process, it actually is just that; the result of hours of kicking the tires before finding out that the answer was right under my nose! Like I said; we all make mistakes. There's nothing to be ashamed of. I'm going to go curl up in a ball and twitch for a while now ... ;)


, Mike

Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.