Tuesday, May 13, 2008

Killing Zombie Processes In Linux And Unix

Greetings,

Today's post is going to deal with "zombie" processes. These are processes to which the definition of a process only loosely applies.

A zombie process is most often generated when a parent process loses track of its child process and that child process becomes detached. The parent process, generally running some sort of a "wait()" call to receive notification that the child process has exited, loses track of the child process and never receives that information. The child process exits normally, but the parent thinks it's still running, and thus is a zombie process born :)

There are a number of steps to take, from simplest to most obscure, to get rid of zombie processes. And then there's what to do if none of that seems to work. Here we go :)

1. First, identify the fact that you have zombie processes running on your system (you may not notice, and there's a good reason why, which we'll address near the end of this post). You can do this on most major brands of Unix and Linux by running:

host # ps -el|grep Z <--- The -l flag to ps will include the "state" column. The zombie state is represented by a capital Z.

On Solaris 9:

host # ps -el|grep Z

F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 Z 0 3038 1 0 0 - - 0 - ? 0:00
0 Z 0 19769 2966 0 0 - - 0 - ? 0:00


On SUSE Linux 9:

/home/ymdg001# ps -el|grep Z
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 Z 0 476 9874 0 0 - - 0 - ? 0:00


2. Now that you know you have zombie processes running, the first, and easiest, thing you can do to kill them is to try and "assassinate" the actual defunct (zombie) process. For the process in the SUSE example above, you can always try:

host # kill -9 476

It probably won't work, but it's worth a shot. Sometimes it does and your troubles are over just like that :)

3. Next, you should try to kill the parent process. This may or may not be possible for you. For instance, the parent process may be a process that you "need" to have running. It may also be a process that the Operating System "needs" to have running (like "init" - process 1, shown in the Solaris example above, under the PPID column).

Killing the parent process (if possible) will almost always work to get rid of a zombie process.

Please "never" try to kill init (process 1). If you're successful, your machine will go down hard and fast!

4. Assuming none of the above worked, some common wisdom says you should just give up (and for good reason, which we'll get to very soon ;). However you can try killing both the zombie process (and/or the parent process) using signals other than SIGKILL (or -9). I've seen it happen more than a few times. Different programs trap, and/or handle, different signals different ways. If your zombie doesn't go away when you execute a "kill -9" against it, try a simple "kill" (Which is, technically "kill -15" or SIGTERM). You can try to kill the process with any signal you want. I generally try signals 1 - 15 and then SIGUSR1 and SIGUSR2, just in case they're defined differently for that particular program on that particular system. You'd be surprised how many zombies you can whack with a SIGHUP or SIGINT. Sending a kill SIGCHLD or SIGCLD (Which is the same as SIGCHLD on System V) is a good one to try, as well. Sometimes your chosen method won't make "textbook sense" but it will work from time to time :)

You can find a handy list of signals to try in our old post on translating signal names to numbers and vice versa.

5. And the point I've been alluding to throughout this entire post.

What to do if your zombie process just won't die, you can't kill the parent and/or you're otherwise stuck?

The answer is: nothing.

Here's a brief explanation why: Even though zombie processes alarm most casual users of Unix and/or Linux, and they can make the process table look ugly with all those "defunct" messages scattered in between everything else, a zombie process lives up to its name in more ways than the sense defined above. It literally is like the somewhat-living dead. Although the proc table (and filesystem) have space reserved to record it, the process has already exited and is not consuming any of your system resources. It takes up none of your kernel or system space and is only a minor nuisance since "times" keeps track of its time (If you're a fly, you'll notice the 0:00 slow-down ;)

6. But WAIT!

There's more... (I'm starting to sound like a pitch man ;). Here's one last thing you can do if that ps entry for your zombie process is really bugging you: Once the zombie has totally disconnected from its parent process, you can just use the "wait" command to make it go away. For example:

host # ps -el|grep Z
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 Z 0 3038 1 0 0 - - 0 - ? 0:00
host # id
uid=0(root) gid=0(root)
host # wait 3038


...and when that returns (I'd recommend that you run this with "&" to background it - e.g. "wait 3038 &")

host # ps -el|grep Z
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD


It's gone :)

In any event, hopefully, after reading this, you'll no longer worry about zombies :)

Cheers,

, Mike