Thursday, February 28, 2008

Finding and Reading Files In The Shell When Your System Is Hung


Today we're going to look at a situation that occurs probably more often than it should ;) If you've done your fair share of system administration on Linux or Unix, you've probably run into a predicament where you got paged, called, etc, to look at a machine that was hanging on the brink, only to find that (asssuming you could log in at all) it was so completely trashed that it couldn't even muster up the strength to run the simple system commands you needed to diagnose the problem before just giving up and rebooting.

Typical errors you'd get at the command line, in this sort of situation, would be similar to:

host # ls /tmp
Insufficient Memory!


host # cat file
Cannot fork new process!

Basically, you're stuck in a situation where you can't do anything that relies on executing anything other than the login shell that you're lucky you got in the first place ;)

The good news is that you can, quite possibly, get as much information as you need before rebooting by simply using your shell and it's built-ins. All of these examples should work for sh, ksh, bash, etc (Possibly not in csh, but, hopefully that's not your system default root shell). If you have a good idea what's wrong before your machine goes down entirely, they can even help you decide what you want to do before you boot the system back up (Check out this post for some simple tricks to figure out what commands are available to you in Solaris' PROM).

Here are the things I try to do when I find myself in that sort of situation (in no order of importance ;)

1. Move around the filesystem.

Luckily, the "cd" and "pwd" commands are built into the shell, so you can always move around your filesystem (even if you are, figuratively, in the dark) and get to the hot spots you want to check. For instance:

host # cd /var/log

will work just fine. This is the most obvious thing you can do, but cd (on its own) depends on you to know where you're going. You can't cd to a directory that doesn't exist even when things are up and running perfectly ;)

If you happen to get lost, you can figure out where you are, using the built-in "pwd," like so:

host # pwd

2. Take a look at the contents of your filesystem.

This is actually pretty obvious, as well, once you realize how to do it. You won't be able to use "ls" any more, since that is a command that the shell invokes outside of itself, but you can always use the built-in "echo" command, like so:

host # echo *
bin opt sbin usr tmp var

Note that this output won't usually be so pretty. If there are 50 files and/or directories in the directory you're in, you'll just get 50 filenames in a row on one line. But, it's better than nothing :)

3. Capture the contents of critical files that can help you troubleshoot and/or find your root cause so this won't ever happen again (hopefully).

This last one is slightly less obvious, but can be done in a variety of ways. The first two ways are messy, but they work. Simply read your file as though you were sourcing it, using either the "source" or dot "." built-ins. The reason I don't prefer these two methods is that your screen fills up with a lot of garbage and you may hang your system by sourcing the contents of a file that contains executable statements or commands. For our example here, we'll assume a file named BOB with one line in it that says "hi":

host # source ./BOB
-bash: hi: command not found


host # . ./BOB
-bash: hi: command not found

You've gotten your output, but you can see where the potential problem would lie. What if "hi" was a command and your "source" directive tried to run it on your already half-dead machine? It might put it all the way down right then and there.

In these instances, I think output redirection is your best bet. You can read the contents of a file by simply executing a new file descriptor and reading from that with "echo." You can use pretty much any number that works (although, try to stay away from 0, 1 and 2 as these are your system's STDIN, STDOUT and STDERR file descriptors), you won't have to read your file through the clutter of a bunch of error messages and you will be insulating yourself from accidentally executing any commands. For instance, the following would get you much better results:

host # exec 7<BOB <--- Execute new file descriptor 7 and redirect the output of your BOB file to it.
host # while read -u7 line; do echo $line; done <--- Then, just read from the file descriptor.

And that's about it. If you use all 3 of these methods in whatever combinations are necessary, you should be able to collect most of the information you need to assess your situation and/or provide root cause.

For one last example that uses all 3, this is how I would go about getting the contents of my /var/log/syslog file (I'll shorten the output to only include the relevant stuff) - Note that I'm also doing the syslog reads in a command line "for loop" because I want to get all the information I can with as little typing as possible:

host # pwd
host # cd /var/log
host # echo *
sysylog syslog.1 syslog.2
host # for x in syslog.2 syslog.1 syslog; do exec 7<$x;while read -u7 line;do echo $line;done;done
<--- All of the sylog files' output. Notice that I did them in reverse, so that the output would be from oldest to newest. It's also a good idea (if possible) to either log your terminal session or set your terminal client's buffer to a very large number so that you can cut and paste this output into your desktop editor.

Hope this helps you out :)

Best wishes,

, Mike