Today's post is a little bit of a walkthrough of using RedHat Linux strace to debug (and find the root cause of ) a system issue and a little bit of caution about how much information you should really share with application project managers if you don't want to be stuck supporting a hacked solution for longer than anyone should. As far as job security goes, it can't hurt, but certain situations really do require that folks upgrade their software to the vendor supported version, no matter how creative your solution ends up being.
Note: If you're looking for an in-depth examination of strace, this post isn't it. Not to turn any readers off, as we'll certainly be exploring that in a future post, but this is more of a walkthrough of a problem resolution involving strace rather than a dissertation on the command itself.
In this particular instance, we were working with a product (heretofore referred to as the Product or Product) that, to my knowledge, had just begun having "issues." This generally means to me that somebody did something they weren't supposed to or were trying to do something they weren't supposed to. I'm cynical, but I try to have the common courtesy to keep it to myself ;)
The Product was a client/server application running on the Java platform that had suddenly begun dropping connections from the application server to the backend database. I did a quick check of the /proc/****/status file (and, of course, gave a quick nod to "top" and "netstat -an"), confirming that they were dropping quite a few, like so:
host # cat /proc/14653/status | grep Threads
Threads: 311
host # top <-- Truncated to just show the top process, which was the "failing" one.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14653 user 16 0 1634m 1.1g 40m S 120 14.1 296:46.54 java
I found the "total time used" (listed under the TIME+ column) somewhat alarming since the process had only been up for about a half an hour! It couldn't possibly be correct.
The next thing I did, since I like to go to the guts when I get brought in on an emergency issue regarding a Product I've never used before, was to shut it down and restart it with strace; teeing the information off to a file in /tmp (since I like to watch the characters fly by), like so:
host # strace -vF ./product arguments 2>&1|tee /tmp/OUTPUT
Note that this could have also been accomplished by running the process, getting it's pid from the ps tables and adding the -p PID flag to the strace command in order to attach to the process after starting it. I prefer to start the process with strace, if I can, in case something critical happens during the initial load-up.
Long story short (It is Sunday - Our day of rest - after all ;), I found that the Product process was dumping tons of FUTEX_WAIT errors. At this point, I checked the version of the OS with uname for the only part I really cared about at that point:
x86-64
The machine was running RedHat Advanced Server 5 - 64 bit. The FUTEX_WAIT errors have to do with the implementation of Fast Userspace Mutex locking implemented in the stable version of the RedHat Linux kernel 2.6 and up. This raised a red flag, and caused me to ask the inevitable question: When did this start happening?
As it turns out, the Product had previously been hosted on another machine running RedHat Advanced Server 4, 32 bit. This made perfect sense, since the Mutex locking mechanism at the kernel level they "used" to run on was completely different, and not compatible, since the Product was built to use the old style Mutex locking mechanisms.
Futile attempts were made to add lame fixes, like:
LD_ASSUME_KERNEL=2.4.whatever (Apologies; I don't remember what it was off-hand since it didn't end up helping at all ;)
to the Product's startup scripts, but this didn't resolve the problem. The "actual" (and correct) resolution would have been to upgrade to the vendor-supported version of the Product. The manufacturer of the Product actually had a more recent version available that was built specifically to address this issue and run on RedHat AS 5 - 64 bit.
And, here's the kicker (as per the cautionary message at the beginning of this post): As some of you may know, a bug in strace (It's actually more of a "feature" since it almost "has" to happen) makes it slow down a process slightly due to all the extra work it has to do to parse all the system calls, opens, reads, writes, errors, library loads, etc.
strace slowed the process down enough that the FUTEX_WAIT errors stopped occurring and the Product began functioning normally again. We ended up adding it to the init scripts after much debate over spending a few bucks to avoid possible total confusion in the future.
In retrospect, I'm glad that I was able to help identify the issue and find the root cause. On the other hand, I wish there was some way I could have avoided identifying the problem without providing a cheap and quick "out" for the end user. I'm sure, in the future, someone else will have to work on this Product (maybe a problem with the version of Java or something more insidious) and they'll be given about as much information as I was. Documentation was laid down, but I don't expect anyone will read it. Sometimes, the solution is as bad as the problem, in a way or two ;)
Enjoy your Sunday :)
, Mike
linux unix internet technology