Thursday, November 6, 2008

More Quick Ways To Find CPU Bottlenecks On Linux

Hey there,

Yesterday, we took a look at some useful commands to help identify memory bottlenecks in Linux. More specifically, we were looking at SUSE 9.x. We're going to use the same Linux version today (for our examples), although - again - much of this stuff translates fairly simply to other distro's. This post will be different from yesterday's in that we'll be focusing on specific CPU-related commands you can use, in a time of crisis (or, perhaps, just a long, drawn-out eternity of soul-crushing boredom ;), to determine if the CPU(s) on your machine are at the fore of whatever problems your system is having.

On a completely unrelated note, last night's election was, indeed, historic and satisfying. Of course, I didn't get to sleep until 3am because my wife was waiting for Indiana to finish counting its votes and hoping for an Obama landslide. When I snapped out of the temporary stupor that substituted for sleep this morning, I noticed that Indiana was still a "partially yellow" state. Hopefully, they'll get the votes counted before I publish this post. If not... I'll just thank God that my wife doesn't read this blog. She's a wonderful woman, but (like most people who've known me for a long time) probably more than willing to snatch the life right out from under me ;)

And here we go. Today's hit list for CPU testing on SUSE:

1. top. This command comes in first again. Explaining it again isn't necessary. When you looked it over to find your memory bottleneck you, no doubt, noticed the %CPU column and summary at the top. About the only thing special, with regards to CPU reporting on top, is how it deals with a multiple-CPU system. You can generally flip between the regular output (All CPUs' statistics combined) and forcing it to show per-CPU stats by using the capital "I" (You may see a message indicating the "Irix mode" is either off or on. On some builds, I've seen this work but give no verbose indication of the change).

2. host # more /proc/cpuinfo

General output will look something like below. Generally, on most newer (and just slightly older) machines, you'll be dealing with CPU's that list out in /proc/cpuinfo as more than they "physically" are. That is to say that hyperthreading/multiple-core CPU's will not appear in this file as the single physical entity that they are. Of course, your situation may vary, but this file should (at the very least) give you a feel for whether you have a bad CPU problem. In a situation where you have 4 physical CPU's (hyperthreading to simulate 8 CPU's) you can get a good indication of whether the problem your facing (we'll just assume you're facing a problem ;) is of a physical nature. If 2 virtual CPU's are down (in proper sequence), you probably need some new parts :) The "physical id" line value, when compared with the "processor" line value, is usually a good indication of whether or not your system is using hyperthreading or any other virtual enhancements. Odds are, you'll probably know this information before you ever have to look at this file.

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) MP CPU 2.5GHz
stepping : 5
cpu MHz : 2495.259
cache size : 512 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4980.73

3. vmstat 2 - another return visitor from yesterday's post on memory mangle-ment ;)

This will run vmstat every 2 seconds, ad infinitum. You should check the "id" (idle percentage), "us" (percentage of CPU resources dedicated to user processes) and "sy" (percentage of CPU resources dedicated to kernel processes) columns first. If the idle percentage is low, knowing whether user processes (like a program running on your system) or the kernel (basically, the operating system and all its built-in facilities) are taking up all the CPU resources can get you pointed in the right direction early on.

You'll also want to consider the "wa" (I/O wait - although this does not necessarily mean that you're experiencing CPU-related I/O wait), "in" (kernel interrupts) and "cs" (kernel context switches) columns as well. High activity in any of these columns could indicate overuse of the CPU (Note that vmstat, although it can tell you a lot about what's going on with your system, cannot pinpoint the particular application or system setting that may be causing the events it reports!)

4. If you do notice a high number of CPU interrupts in your vmstat output, be sure to check out the contents of /proc/interrupts. Check it, for instance, every 10 seconds for a few minutes. Within that amount of time, the contents of the /proc/interrupts file may point you directly to the culprit. This may not be the answer, but should provide you some relief while you find the real problem and need to verify it doubly :)

Note that, as a rule, lots of kernel interrupts and CPU context switches (especially in the thousands) are a fairly good indicator of CPU load reaching maximum capacity.

5. Check your standard log files in /var/log. If you find a ton of messages there (or even just a few), they can provide invaluable clues. Combining this additional information with the output of vmstat, top and (possibly) the contents of /proc/cpuinfo and /proc/interrupts, should paint a fairly vivid picture and allow you to assess, quickly, whether or not you need to focus more effort on reducing CPU load or, possibly, replacing a bad CPU or two.

Once again, I wish you a good night and hope this little introduction to CPU bottleneck troubleshooting has been "accessible" or, at least, somewhat helpful to you :)


, Mike

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.