Sunday, November 25, 2007

TSM Backup's Most Misleading Error Message

As much as I'd love to, I won't give you my opinion of Tivoli Storage Manager (TSM) in this post. It's not really about what I think, but will hopefully help out a few folks. TSM, in my experience, is not too difficult to install and set up on a host, but troubleshooting it can be another issue. A lot of times, the error messages it gives you lead you away from the problem rather than to it.

Which brings us to today's post topic: TSM's most misleading error message. This error specifically has to do with Oracle backups.

Assuming your installation has gone without incident, you've already logged into the TSM server from your host and verified all's well, like so:

host.xyz.com# dsmc q sess

And, after logging in, if you haven't set that up already, you see some marginal information about the TSM version, when the last backup was run, the schedule of backups, etc.

Now, you're ready to turn the product over to the client. It's been installed, but never run. And, surely enough, the next day (at the latest ;) you receive a call that the Oracle backups didn't work. Some minimal investigation into the log files (since logging into the TSM interface from the host using "dsmc" doesn't seem to show any issues, except that no Oracle backups have run) shows that something's very very wrong! You find yourself faced with this scary error:

ORA-19554: error allocating device, device type: SBT_TAPE, device name:
ORA-27000: skgfqsbi: failed to initialize storage subsystem (SBT) layer Linux Error: 106: Transport endpoint is already connected
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
SBT error = 7011, errno = 106, sbtopen: system error


So, your first inclination might be to go talk to the folks that manage site backups, because there's obviously a problem with the tape loader or a tape drive. But, this couldn't be farther from the truth.

Here's the kicker. The real issue has nothing at all to do with the error message! The issue is with the "dsmc" process not being able to write to its logs as the oracle user id! Yes, that's true :)

The good news is that, once you've gotten past the mystery, the solution is relatively simple. You can fix this problem very quickly by doing the following:

1. View the dsm.sys file that you've set up (it should be linked to /usr/bin/dsm.sys) and find out where the error logs and activity logs are being written. If you haven't specified this, your logs should be under the default installation directory under logs/tivoli/tsm -- depending on your version, the location may vary)

2. Now simply fix the permissions on the log so that the oracle user id can write to them:

chgrp dba errlog.log <--- Assuming dba is oracle's primary group.
chmod g+w errlog.log

3. Verify that all directories above this are accessible as well. The easiest way to figure out if you've gotten it right is to use the oracle user id to do the check:

su - oracle
cd /log/location/
echo hi >>errlog.log
vi errlog.log
<--- To remove the line with "hi" on it if you want.

Once that's all set, the "bad tape" error goes away and Oracle backups start working. How about that? :)

, Mike