Sunday, April 6, 2008

Troubleshooting To Find The Bottleneck On Unix and Linux

Hey there,

Today, we're going to follow up on yesterday's post regarding the definitions of swapping and paging on Linux and Unix, as well as our humble follow up post on clarification of the definitions of paging and swapping with the a tutorial on basic troubleshooting. Today we'll get our primary examples from Solaris Unix and point up the differences, where they exist, in extracting the same information from Linux. Our only really unique bent is that we'll be coming at the issue by considering paging and swapping activity and going from there, assuming no knowledge of what could possibly be causing the problem. All we know is we have a server that's "really slow," which is "not good" ;)

The first thing we'll do is hop on the machine and take a look at vmstat. You'll note that, in each example, I'm zeroing out the values we don't need to look at to make the individual examples easier to read.

This is what we see:

host # vmstat 1 5 <--- We're running vmstat to give us ouput every 1 second 5 times. We've removed the first line from the output below, because it is always a "summary line" (averaging all recorded activity since the last reboot) and can sometimes be misleading.

kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 sd sd -- in sy cs us sy id
- - - - - s u m m a r y l i n e - r e m o v e d - on - p u r p o s e - - - - -
67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9
57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12
64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6


This condition would indicate that we've probably got a problem with our CPU (Remember all the zeroed out values are assumed to be "normal" for this server). The "run queue" is very high (the first column "r"), which indicates that there are an average of approximately 59 processes waiting for CPU execution time at any given second. We couple this with the fact that the CPU "idle time" (the last column "id") is very low, along with the fact that there's no indication of any paging or swapping activity at all (which will almost never happen, really), and it becomes fairly obvious that the bottleneck lies with the CPU.

No special options to vmstat are required to see this information on Linux, but the "id" column is generally second in from the right.

Now, if we change this output slightly, the bottleneck most probably changes to our system's memory:

kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 sd sd -- in sy cs us sy id
- - - - - s u m m a r y l i n e - r e m o v e d - on - p u r p o s e - - - - -
67 0 0 0 0 12 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 77
56 0 0 0 0 15 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 67
57 0 0 0 0 10 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 78
64 0 0 0 0 8 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 85


Notice that in this next example, pretty much everything is the same, but the "scan rate" (the "sr" column) and "page reclaim" (the "re" column) values have increased dramatically. Generally, numbers like the ones I'm posting here wouldn't make my pulse change, but, for the sake of argument, we'll assume that the "scan rate" and "page reclaim" rate have been flatlining at 0 ever since this server launched. An increase in the "scan rate" indicates that the system is paging more heavily; that is, it's spending a lot of CPU cycles trying to manage writing from memory to disk and from disk to memory. One might assume that this situation would indicate a problem with the CPU, but the "idle time" doesn't agree with that assumption. Also, it helps to keep in mind that paging occurs more frequently when the system runs out of real physical memory to read from, and write to, and has to revert to using "disk based" virtual memory, which it interacts much more slowly with. This is bolstered by the additional "page reclaim" activity that is going on. Adding physical memory to this server will probably fix the bottleneck.

On Linux, in order to grab comparative "page reclaim" and "scan rate" values, you'll need to take a look at /proc/vmstat, like so:

host # cat /proc/vmstat|grep pgscan
pgscan_kswapd_high
<--- These statistics are for the generic "scan rate"
pgscan_kswapd_normal
pgscan_kswapd_dma32
pgscan_kswapd_dma
pgscan_direct_high
<--- This is were the generic "page reclaim" statistics begin
pgscan_direct_normal
pgscan_direct_dma32
pgscan_direct_dma


You can also check, specifically, for pages reclaimed by "inode stealing."

host # cat /proc/vmstat|grep pginodesteal

There also may be other variations, depending on the flavor of Linux you're running. Catting /proc/vmstat and doing a:

host # cat /proc/vmstat|egrep 'scan|steal'

should get you most, if not all, of them.

And, in our final permutation, the bottleneck becomes the disk:

kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s6 sd sd -- in sy cs us sy id
- - - - - s u m m a r y l i n e - r e m o v e d - on - p u r p o s e - - - - -
0 67 0 0 0 0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0 77
0 56 0 0 0 0 0 0 0 0 0 0 98 0 0 0 0 0 0 0 0 67
0 57 0 0 0 0 0 0 0 0 0 0 89 0 0 0 0 0 0 0 0 78
0 64 0 0 0 0 0 0 0 0 0 0 94 0 0 0 0 0 0 0 0 85


Again, these numbers aren't crazy, but now we've got an entirely different situation on our hands. We can see, from this example that our only disk (the "s6" column) is now starting to do some heavy writing (yes, I know the numbers aren't really that big. This is all relative ;) This might not be an issue, but the server's "slow." We couple this with a drastic increase in processes that are blocked waiting for I/O (the "b" colomn), along with a lack of significant swapping and/or paging activity, and the disk begins to look like the favorite. The "b" column can generally be considered a very vague indicator of the actual issue, since it reports on processes blocked waiting for resources no matter what they are (the CPU could be slow, the memory could be bad; even the network could be down). However, when we combine this with the fact that disk read/write activity has increased, the disk becomes our most probable bottleneck on the system.

On Linux, using either "vmstat -d" or "vmstat -p ARGUMENT" (with ARGUMENT being a specific partition) will get you the disk statistics.

These examples have been fairly stark. Always keep in mind that, outside of this vacuum, it's always good practice to keep regular tabs on your servers. For instance, you might run vmstat a few times a day (or a few times an hour) and record the results. When you get into habits like that, you're much better prepared when a problem does arise, as you'll have a good "baseline" to refer to when dealing with abnormal behaviour on any of your systems. Everyone will notice the biggest number in the vmstat output when they're frantically trying to figure out what's wrong, but (armed with your "baseline" knowledge) you'll know if that number has always been gigantic, in which case you can ignore it and go on to fix the actual problem :)

Hopefully this little exercise has been helpful to you; if even in the tiniest way :)

Cheers,

, Mike