Friday, August 8, 2008

Using Grep To Streamline Your Shell And Command Line Scripting

Good [insert part of day here] ;)

For today's post, we're going to look at using grep to reduce script-bloat. Of course, we won't be covering every possible way this can be done, because I have yet to finish perfecting the art of using editor tricks to reinforce bad habits. The vi editor's yank (yy), put (p), double-quote (") buffers and dot (.) action repetition simplifiers make it very easy for me to write long and unwieldy code when I don't feel like thinking about how I could fit 40 lines into 3 ;) The downside is that it's a royal pain to go back and modify. Not only do I have to go back and forth 3 screens to check code above and below my "text dump," but I also have to fix the same problem 15 times if I made it in one instance of my code and then just mass-copied and slightly modified it. Of course, you can always use the substitute (s/g), etc, commands to make sweeping edits, but you get my drift. It's generally not more-or-less efficient to write unrolled loops (since that's what your shell command interpreter is going to do to your tight loops when it executes them anyway), but it can be a huge and confusing time-waster for all the humans who have to deal with it later; even for the original author.

In any event, I have managed to digress, yet again ;)

Today, we're going to look at a few ways we can use grep to clean up our shell scripting and make it look (and, most of the time, actually "be") more efficient. The solutions presented in this post can be one-upped by any number of other standard Unix/Linux commands, but (for the sake of this post) we're going to assume that "grep" is the be-all-end-all of input and output processing. Suspend your disbelief, for as long as it takes to read this post, and assume that the world is flat, grep's it and that's that ;)

On to our examples!

1. Generating line counts: This is one I see a lot (and have done a'plenty). Most Unix/Linux admins and users know of the wc command and get used to using it to determine the number of lines in a file like this:

host # wc -l FILE
17 FILE


which is perfectly acceptable. Sometimes though, since the command comes so easily to mind, it gets used inappropriately or, at least, in places where using grep, by itself, would make more sense. For instance, consider this relatively normal looking, and perfectly serviceable, line of code:

host # cat FILE|grep done|wc -l
2


And, oh yes, we're ignoring the indents that wc puts before the output for now :) This count of "how many times the word 'done' appears in FILE" can be pared down very easily, like so:

host # cat FILE|grep -c done
2


and, even better, we can use grep to get a basic "wc -l" line count on any file by using it like this:

host # grep -c . FILE
17


It's a slightly longer command than the "wc" version, but it does save you from having to process out the padded blank space if you want to assign your output to a variable, or use it "in context" within another command string as-is :)

2. Ensuring that your process checking is as accurate as possible: We actually looked at this back in October of 2007 in a post on how to keep grep out of your grep output, but the same basic principal applies here. For instance, if you've written a wrapper script that checks a process every 5 minutes and, at all times, this process must be running, you might make the core of your script (that part which you're hoping won't equal 0) something like this:

host # ps -ef|grep program|wc -l
2


but, you see the same problem you had in our first bullet, so you'd crop that back to:

host # ps -ef|grep -c program
2


But, in this case, we still have an issue, because only 1 instance of your "program" should be running and there appear to be two. Actually, if we peel back the veneer, there "is" only one instance running, but grep is ending up in the output (and, even worse, you can't depend on grep to "always" show up in its own output, which means our result may be 1 or 2 with everything being OK):

host # ps -ef|grep program
user1 2692 8747 0 13:33:52 pts/11 0:00 /bin/sh ./program
user1 3269 8747 0 13:37:05 pts/11 0:00 grep program


So, now we can use another trick to make this work correctly every time. A lot of times you may see code like:

host # ps -ef|grep program|grep -v grep|wc -l

but this is quite a bit longer, now, and we also can't use "grep -c" because it would make the following "grep -v grep" statement fail. And, since we can't count on "grep" always showing up in its own output, we couldn't predicate success or failure based on any specific number (in this case 1 or 2). Plus, if we did, this code would get even more lengthy and confusing.

Instead, we can use the shell's natural range operators "[]" to ensure that (if only 1 instance of "program" is supposed to be running) we'll always only get back the number "1" when things are going well:

host # ps -ef|grep -c "[p]rogram"
1


and that's it. The reason this works is that the shell processes the range operators (which, in our case contain a range of one letter) and pass the string "program" to grep. So, grep is scanning the ps output for "program," but the range operators still appear in the ps output (for the grep command). Therefore, it can never match its own process because ps is literally showing "[p]rogram" (with the left and right brackets ([]) being actual left and right brackets, rather than range operators which get interpreted).

Here's what the command output looks like in the working version above:

host # ps -ef|grep "[p]rogram"
user1 2692 8747 0 13:33:52 pts/11 0:00 /bin/sh ./program


and modified to look for "[g]rep" so we can see the grep and note the difference:

host # ps -ef|grep "[g]rep"
user1 5864 8747 0 13:47:16 pts/11 0:00 grep [g]rep
<-- Note that this match is matching "grep" and not "[g]rep" ;)

And that's it for today. There are actually quite a few more things you can do with grep, that get done a lot more efficiently as a result, but we'll take a look at those another day. As usual, it seems, in attempting to explain as clearly as possible, I have managed to crank out yet another tractate. It's times like these that I start to feel a bit long in the tooth. The older you get, the more outcomes you try to compensate for. That can make for some long posts. It can even make for some long endings to long posts ;)

Hopefully, it's been a relatively bearable reading :)

Cheers,

, Mike