Showing posts with label lsof. Show all posts
Showing posts with label lsof. Show all posts

Tuesday, August 12, 2008

Recovering Deleted Files By Inode Number In Linux And Unix

Hey there,

It should be noted, at the outset, that this post is limited in its scope. We're going to be looking at one particular way in which you can recover an accidentally deleted file on Linux or Unix ( Tested on RHEL3 and Solaris 8 ). If you ever want to scour a hard drive that you need to get lots of information back from, then (assuming that you quarantined it immediately upon noticing this and haven't written to it since) you should check out The Coroner's Toolkit. Specifically, you'll want to look at the "ils" or "icat" programs, and most probably the "grave-robber" application to recover as much of everything as possible.

In our post, we're going to look at one condition under which Unix and Linux operating systems can actually hold onto files, even after they've been deleted, so that you can recover them. Of course, you have to realize you've deleted something you wanted to keep (in most cases) immediately, and a fairly specific set of circumstances has to be in play.

Note that all demonstrations have been tested on Solaris and RedHat. We're using RedHat's output, but the differences between the two were minor enough that they didn't bear repeating.

First, we'll look at the controlled scenario: In this case, we're going to create a file. Then we're going to delete it, knowing that we want to retrieve it afterward. How many times does that happen "by accident"? ;)

host # echo "hi there" >>FILE
host # ls -lai
total 8
57794 drwxr-s--- 2 user group 4096 Aug 11 12:36 .
851795 drwxr-s--- 3 user group 4096 Aug 11 12:31 ..
57795 -rw-r----- 1 user group 9 Aug 11 12:36 FILE


Now that the file's created, we can cat it by referencing the inode number (gotten with "ls -lai" above) very easily

host # find . -inum 57795 -exec cat {} \;
hi there


Then we'll set ourselves up so that we can delete the file and still be able to get it back. In order for us to be able to retrieve the deleted file later, we'll need to associate it with a filehandle. One easy way to do that is to run a tail (or similar command) on it):

host # tail -f FILE &
[1] 4741


and then we'll delete it:

host # rm FILE
host # ls -lai
total 8
57794 drwxr-s--- 2 user group 4096 Aug 11 12:41 .
851795 drwxr-s--- 3 user group 4096 Aug 11 12:31 ..


So, now it's gone. But, and this is the only thing that's saving us a whole lot of headache, the file is still open in memory since we still have a "tail -f" job running in the background. This means that the tail command still has a filehandle open for our file FILE. Of course, we can't refer to it by that name anymore. One interesting thing to note about an inode is that it contains virtually all the information you ever wanted to know about your file... except it's name! :)

Therefore, the following query with lsof fails to produce results:

host # lsof|grep FILE

IMPORTANT NOTE: If you have followed this process closely and either wrote down or remembered the "inode number" of the FILE file before we deleted it, you can skip all of this lsof stuff. Jump straight to the next IMPORTANT NOTE in this post :)

A quick look at what pseudo-terminal we're using, coupled with the knowledge that we ran "tail -f" on the file, makes for a pretty tidy grep string (of course, you don't need to have this much information. You can do this without of a filter and just have more lsof output to muck through):

host # tty
/dev/pts/0
host # lsof|grep pts/0|grep tail
tail 4741 user 0u CHR 136,0 2 /dev/pts/0
tail 4741 user 1u CHR 136,0 2 /dev/pts/0
tail 4741 user 2u CHR 136,0 2 /dev/pts/0


Now we know the PID (second column in from the left) of the process that still has the file open and we can use lsof to drill down even further, using the output of the pwd command to whittle down the output:

host # pwd
/home/users/user
host # lsof -p 4741|grep home/users/user
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
tail 4741 user cwd DIR 0,22 4096 57794 /home/users/user (host:/remote/users)
tail 4741 user 3r REG 0,22 9 57795 /home/users/user/FILE (host:/remote/users)


IMPORTANT NOTE: Welcome back, if you already knew the inode number, and "on we go" to everyone! One interesting thing to note about the lsof output is that there is only a "read" filehandle open for FILE. This is normal, since we're doing a "tail -f" and there doesn't need to be a write or read/write file descriptor active.

Now we can verify that the our "file" still exists by accessing it via the inode number:

host # find . -inum 57795 -exec cat {} \;
hi there


And, it's still there :) In order to preserve it (since it the inode-fd connection will be severed as soon as "tail -f" quits), we'll use similar "find" syntax to copy the inode to a filename and verify it, like so:

host # find . -inum 57795 -exec cp {} FILE.recovered \;
host # ls -lai
total 8
57794 drwxr-s--- 2 user group 4096 Aug 11 12:42 .
851795 drwxr-s--- 3 user group 4096 Aug 11 12:31 ..
57796 -rw-r----- 1 user group 9 Aug 11 12:42 FILE.recovered
host # cat FILE.recovered
hi there


Now we can quit our backgrounded "tail -f" job and not worry about it.

host # kill %%
host #
[1] + Terminated tail -f FILE


Like I said, a neat trick, but will only work if you get lucky. Pardon the double-cliche, but keeping calm when you realize that you may have royally screwed the pooch can go a long way toward keeping you from having to take the long way home (isolating the disk, scouring it with forensics tools, perhaps grepping the raw filesystem and manually reconstructing, etc).

The one thing to remember is that, if you delete a file on Linux or Unix, as long as one other process, that was using it when it existed, is still up and running after you delete it, there's a possibility that you can get your file back fairly quickly (actual results may vary. The possible scenario's are many and varied).

And, to give us a fair sendoff, here's a quick look at what you can expect if you delete a file and no process is attached to it and/or has a filehandle open against it (If you don't like watching train-wrecks, then I'll bid a good evening to you right here and now - cheers :)

host # ls -lai
total 8
57794 drwxr-s--- 2 user group 4096 Aug 11 12:43 .
851795 drwxr-s--- 3 user group 4096 Aug 11 12:31 ..
57796 -rw-r----- 1 user group 9 Aug 11 12:42 FILE
host # find . -inum 57796 -exec cat {} \;
hi there
host # rm FILE
host # find . -inum 57796 -exec cat {} \;
host # tty
/dev/pts/0
host # lsof|grep pts/0
ksh 3907 user 1u CHR 136,0 2 /dev/pts/0
ksh 3907 user 2u CHR 136,0 2 /dev/pts/0
ksh 3907 user 11u CHR 136,0 2 /dev/pts/0
lsof 5062 user 0u CHR 136,0 2 /dev/pts/0
lsof 5062 user 2u CHR 136,0 2 /dev/pts/0
grep 5063 user 2u CHR 136,0 2 /dev/pts/0
grep 5064 user 1u CHR 136,0 2 /dev/pts/0
grep 5064 user 2u CHR 136,0 2 /dev/pts/0
host # lsof|grep FILE
host # lsof|grep 57796
host # ls -lai
total 8
57794 drwxr-s--- 2 user group 4096 Aug 11 12:44 .
851795 drwxr-s--- 3 user group 4096 Aug 11 12:31 ..


...bummer.

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Monday, June 9, 2008

Finding The Number Of Open File Descriptors Per Process On Linux And Unix

Hey There,

Today, we're going to take a look at a fairly simple process (no pun intended), but one that (perhaps) doesn't come up enough in our workaday environments that the answer comes to mind as obviously as it should. How does one find the number of open file descriptors being used by any given process?

The question is a bit of a trick, in and of itself, since some folks define "open file descriptors" as the number of files any given process has open at any given time. For our purposes, we'll be very strict, and make the (usually fairly large) distinction between "files open" and "open file descriptors."

Generally, the two easiest ways to find out how many "open files" a process has, at any given point in time, are to use the same utilities you'd use to find a process that's using a network port. On most Linux flavours, you can do this easily with lsof, and on most Unix flavours you can find it with a proc command, such as pfiles for Solaris.

This is where the difference in definitions makes a huge difference in outcome. Both pfiles and lsof report on information for "open files," rather than "open file descriptors," exclusively. So, if, for instance, we were running lsof on Linux against a simple shell process we might see output like this (all output dummied-up to a certain degree, to protect the innocent ;)

host # lsof -p 2034
CMD PID USER FD TYPE DEVICE SIZE NODE NAME
process 2034 user1 cwd DIR 3,5 4096 49430 /tmp/r (deleted)
process 2034 user1 rtd DIR 3,7 1024 2 /
process 2034 user1 txt REG 3,5 201840 49439 /tmp/r/process (deleted)
process 2034 user1 mem REG 3,7 340771 40255 /lib/ld-2.1.3.so
process 2034 user1 mem REG 3,7 4101836 40258 /lib/libc-2.1.3.so
process 2034 user1 0u CHR 136,9 29484 /dev/pts/9
process 2034 user1 1u CHR 136,9 29484 /dev/pts/9
process 2034 user1 2u CHR 136,9 29484 /dev/pts/9
process 2034 user1 4r CHR 5,0 29477 /dev/tty


However, if we check this same output by interrogating the /proc filesystem, we get much different results:

host # ls -l /proc/2034/fd/
total 0
lrwx------ 1 user1 user1 64 Jul 30 15:16 0 -> /dev/pts/9
lrwx------ 1 user1 user1 64 Jul 30 15:16 1 -> /dev/pts/9
lrwx------ 1 user1 user1 64 Jul 30 15:16 2 -> /dev/pts/9
lrwx------ 1 user1 user1 64 Jul 30 15:16 4 -> /dev/tty


So, we see that, although this one particular process has more than 4 "open files," it actually only has 4 "open file descriptors."

An easy way to iterate through each processes open file descriptors is to just run a simple shell loop, substituting your particular version of ps's arguments, like:

host # for x in `ps -ef| awk '{ print $2 }'`;do ls /proc/$x/fd;done

If you're only interested in the number of open file descriptors per process, you can shorten that output up even more:

host # for x in `ps -ef| awk '{ print $2 }'`;do ls /proc/$x/fd|wc -l;done

Here's to being able to over-answer that seemingly simple question in the future ;)

, Mike

Thursday, May 15, 2008

Finding An "Invisible" Proc's Working Directory Without lsof On Linux Or Unix

Ahoy there,

Today, we're going to take a look at something that gets taken for granted a lot these days. lsof (a fine program, to be sure. No debate here) has become a very common staple for finding out information about processes, and where they're hanging out, on most Linux and Unix systems today. Much like the command "top," it provides a simple and robust frontend to having to do a lot of grunt-work to achieve the same results.

I find that, for the most part, lsof is used to find out where a process is, or what filesystems, etc, it's using, in order to troubleshoot issues. One of the most common is the "mysteriously full, yet empty, disk" phenomenon. Every once in a while that will turn out to be an issue where all of the inodes in a partition have been used before all of the blocks have, which produces confusing output in df, leading to the mistaken assumption that there is plenty of space left a device even when there isn't.

However, many times, that empty-yet-full disk is the victim of a process that met an untimely demise and never cleaned up a lot of temporary space in memory (or virtual disk, to split hairs). Another issue that lsof is used for is to find out which dag-nabbed process is holding onto a mount-point that claims it's in use when no one is logged on and no user processes are running that would access it (for instance, a really specific, user-defined, mountpoint like /whereILikeToPutMyStuff - Hopefully the OS isn't depending on this to be around ;) Both problems are, essentially, the same.

However, should you find yourself in a situation where lsof either doesn't come with your Operating System, and/or hasn't been installed, you can still break down these two (and I'm just limiting the post to these two particulars so I don't end up writing an embellished manpage ;) separate issues into one, and find the solution to your problems using the commonly available "pwdx" utility.

pwdx will print out the working directory of any given process (using the process ID as input) at its best. But this is enough to get you to the answer you need.

For instance, we'll take this common scenario: /tmp is reporting 100% full, but df -k shows that /tmp is only at 1% capacity (99% of it is unused). My thinking here almost immediately gravitates toward vi, or some other program that opened up a buffer in memory (using /tmp or /var/tmp), got clipped unexpectedly and never let the system know that it was done with the space it allocated for itself. This would normally not be an issue but, since your Linux or Unix machine "thinks" /tmp is full, whether or not it actually is makes no difference. It won't let you use the free space :(

This command line could be used to figure out what process was using that space in /tmp or /var/tmp:

host # ps -ef|awk '{print $2}'|xargs pwdx 2>&1|grep -iv cannot|grep /tmp
2969: /tmp


Taking it a step further (assuming we trust our own output), we could just skip right to the process in question by adding a bit more to the pipe-chain:

host # ps -ef|awk '{print $2}'|xargs pwdx 2>&1|grep -iv cannot|grep /tmp|sed 's/^\([^:]*\).*$/\1/'|xargs -n1 ps -fp
UID PID PPID C STIME TTY TIME CMD
root 2969 2966 0 Mar 21 ? 0:05 /bin/vi /home/george/myHumungousFile


Since it's May already, we can fairly assume that this PID is pointing to a dead process (especially since it has no TTY associated with it), and (double-checking, just to be sure) we can probably solve our problem by killing that PID. See our previous post on killing zombie processes if it won't seem to go away and "ps -el" shows it in a Z state.

Yes, that example was pretty simplistic, but the same methodology can be used to find other programs using up other filesystems. Just like "lsof -d," you'll be able to find out what processes are using what filesystems and narrow down your list of suspects, if you don't nail the correct one right away. Since pwdx comes with your Linux or Unix OS, it's actually statistically more likely than lsof to be correct about what process is using what filesystem :)

Cheers,

, Mike

Monday, December 31, 2007

Network Port Querying Script

Hey there,

The script I've put together here was originally written to meet a certain demand. That demand was actually my own, but that's beside the point ;)

This script should come in useful for you if you ever need to query a port and find out what's going on with it (like who's using it and/or what process id is associated with it). It's simple to invoke (taking only the port number as its argument) and produces information that can be a great aid in troubleshooting network connection issues.

If you refer back to this previous post you can check out a small walkthrough regarding how to query a port using lsof and/or the proc commands. This script uses lsof also, but combines it with netstat to produce output in an easy to read format, while grabbing a little more information in the process. Assuming we call it portquery, it can be invoked like this:

host # ./portquery 22 <--- Let's just see what's going on with SSH

and it will produce output for you like the following. Note that it produces a formatted output block for every single process connected to a port. On a high-traffic machine, checking SSH might produce a few pages of output. This is what it looks like when it's run:

Port 22 Information :
Service = sshd
PID = 469
User = root
Protocol = TCP
Status = LISTEN
Port 22 Information :
Service = sshd
PID = 469
User = jimmy88
Protocol = TCP
Status = LISTEN


...and the list goes on to print out information blocks for every PID attached to that port. This script has been a great help for me not only in that it makes a manual process automatic, but also in that it's easy for other non-admins to read.

Here's hoping you have some use for it :)

Best Wishes,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/ksh

#
# 2007 - Mike Golvach - eggi@comcast.net
#
# Usage: portquery [port number]
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

trap 'exit' 1 2 3 9 15
if [ $# -ne 1 ]
then
echo "Usage: $0 portNumber"
exit 1
fi

portnumber=$1

/bin/netstat -a |grep -w "$portnumber" >/dev/null 2>&1

if [ $? -ne 0 ]
then
echo "Nothing's listening on - or using - port $portnumber"
exit 1
fi

/usr/local/bin/lsof 2>&1|grep -v "^lsof:"|grep -w $portnumber 2>&1|while read x
do
portinfo=`echo $x|awk '{print $1 " " $2 " " $3 " " $4 " " $5 " " $6 " " $7 " " $8 " " $9 " " $10}'`
echo "Port $portnumber Information :"
echo " Service = `echo $portinfo|awk '{print $1}'`"
echo " PID = `echo $portinfo|awk '{print $2}'`"
echo " User = `echo $portinfo|awk '{print $3}'`"
echo " Protocol = `echo $portinfo|awk '{print $8}'`"
echo " Status = `echo $portinfo|awk '{print $10}'|sed 's/(//'|sed 's/)//'`"
done



, Mike




Saturday, November 17, 2007

How to Find a Rogue Process That's Hogging a Port

Hey there,

Today's little tip can actually come in useful even if the information you're seeking isn't "mission critical" (which, by the way, ranks among one of my least favorite terms. If there's one thing positive I can say about where I work now, it's that they don't describe every problem, resolution or project as if we were engaged in war -- but that could be an entirely separate post ;).

I've actually been asked to figure out what process was running on what port more often for information's sake than to try and figure out why something was "wrong," but the same principles apply. The scenario is generally something like the following:

Internal customer Bob needs to start (or restart) an application, but it keeps crashing and getting errors about how it can't bind to a port. This port is necessarily vague, since, in my experience, it's very common to be asked to figure something out with little or no information. I consider myself lucky if I have a somewhat-specific description of the problem at the onset. As we all know, folks will sometimes just complain that "the server is broken." What does that mean? ;)

The troubleshooting process here is pretty simple and linear (perhaps more detail and information in a future post regarding similar issues, as any problem or situation can be fluid and not always follow the rules). In order to try and fix Bob's problem, we'll do the following:

1. Double check that the port (We'll use 1647 as a random example) is actually in use by running netstat.

netstat -an|grep 1647|grep LIST

you can leave out the final "grep LIST" if you just want to know if anything is going on on port 1647 at all. Generally the output to look for is in the local address column (Format is generally IP_ADDRESS:PORT - like 192.168.1.45:1647 or *:1647 - depending on your OS the colon may be a dot). Whether or not you're checking for a LISTENing process, information about a connection from your machine on any port to foreign port 1647 shouldn't concern you.

2. We're going to assume that you actually found that the port is either LISTENing, or actively connected to, on your local machine (if it isn't, your troubleshooting would likely take a much different turn at this point). Now we'll try to figure out what process is using that port.

If you have lsof installed on your machine, figuring this out is fairly simple. Just type:

lsof -i :1647

and you should get a dump of the list of processes (or single process) listening on port 1647 (Easily found under the PID column). They're probably all going to be the same, but, if not, take note of all of them.

3. Run sommething along the lines of:

ps -ef|grep PID

and Problem solved! You now know what process is listening on port 1647 and you'll probably end up having to hard kill it if Bob doesn't have any idea why it won't let go of the port using standard methods associated with whatever program is using it.

But, sometimes, the last part isn't that simple, so:

4. What's that? lsof isn't installed on your machine? My first inclination is to recommend that you download it ;) Seriously, it's a valuable tool that you'll find a million uses for. But you can find out the process ID another way, just in case you can't get your hands on it and/or time is of the essence, etc.

In this instance, and we'll just assume the worst, you can use two commands called "ptree" and "pfiles" (these are standard on Solaris in /usr/proc/bin - may be located elsewhere on your OS of choice and/or named somewhat differently). Use the following command to just grab all the information possible and weed it down to the process using port 1647:

for x in `ptree -a | grep -v ptree | awk '{print $1}'`
do
pfiles $x 2>/dev/null|grep 1647
done


and you'll get the line of output that maps your PID to your port. The above is, admittedly, somewhat messy (not really messy, but you'll end up printing a lot of blank lines ;) Feel free to tailor it to your needs and make it more general (I explicitly used port 1647, but that should also be a variable if you want to create a little script to keep in your war chest).

Run your ps, as above, and now you should know what process is hogging that port and, in the process, making Bob's life miserable. If you cleanly kill that process, Bob should have one less thing to worry about and his program should be able to bind to the now-free port :)

, Mike