Thursday, November 20, 2008

Plain English Explanation Of An Awk Statement For Linux Or Unix

Hey There,

Today's post is going to be a follow-up to yesterday's post on convoluted column arithmetic in awk. I have a nasty habit of listing all my stuff as "beginner" material. Mostly this is because I try to write my material for the middle level to newer user of Unix, Linux, Perl, etc. I have absolutely no interest in picking apart the finer points of quantum mathematics or advanced concepts in programming or operating system design and implementation. I leave that for the academics. Besides, if I did go that route, I'd have to narrow the scope of this blog a "whole lot" ;)

Yesterday's post offered a number of cut-and-paste awk solutions to solving some seemingly difficult data massaging problems and, in my frantic struggle to keep my post under a thousand words, I cut some corners and simply framed the problem, following it with the solution. It's usually a good formula. I'll be the first to admit, though, that the examples were somewhat of a test for me when I first slapped them together and probably deserved to be explained more than they were. To that end, we'll look at one of the examples from yesterday and pick it apart, so that the pieces all make sense to, hopefully, any and every one. I aim to please :)

The example we'll work with is from part one of problem 2, where we looked to add the even columns and, separately, the odd columns and then print those totals at the end of each line. The original awk statement and resultant solution follow (check yesterday's awk post for the original dataset. It doesn't really make a difference insofar as this explanation will be concerned):

host # awk '{sum1=$1;sum2=$2;for (i=3;i<=NF;i++) {if ( i%2 ) {sum1 += $i; printf"%.2f ", $i}else {sum2 += $i; printf"%.2f ", $i}}printf"ODDS %.2f EVENS %.2f\n", sum1, sum2}' DATASET2

51.61 48.39 41.38 58.62 32.00 68.00 56.41 43.59 57.52 42.48 ODDS 284.75 EVENS 315.25
49.06 50.94 42.86 57.14 100.00 0.00 0.00 100.00 11.49 88.51 ODDS 259.96 EVENS 340.04
52.17 47.83 26.83 73.17 61.54 38.46 57.69 42.31 33.33 66.67 ODDS 273.23 EVENS 326.77
34.38 65.62 0.00 100.00 59.26 40.74 100.00 0.00 43.43 56.57 ODDS 257.07 EVENS 342.93
25.53 74.47 35.56 64.44 84.42 15.58 47.53 52.47 0.00 100.00 ODDS 243.04 EVENS 356.96


At first glance, I suppose that awk statement could look imposing. Fortunately, once you pull it apart, it's not all that bad. First, we'll write the awk in "plain English." Here's what the linear thought process would be if we just spelled it out:

To start out, we'll set the variable sum1 to the value of the first field. Then, we'll set the value of the variable sum2 to the value of the second field (This sets us up with sum1 holding the first "odd" field value and sum2 holding the first "even" field value). After this we'll iterate through a looping construct that begins by setting the value of the variable i to 3, continues until the value of the variable i is less than or equal to the number of fields on any given line (NF), and increments the value of i by 1 after every pass.

For each invocation of the loop (which goes until the value of i becomes less than or equal to the number of fields on the line, or record), we check the value of the modulo of i and 2. If you want to look up the definitions of modulo and modulus, I strongly encourage it, but only if you're really curious ;) For our purposes, it will serve to explain that i%2 (i modulo 2) acts as a check on the remainder of a division operation. So, if the value of i is 3, i%2 is equivalent to 3 divided by 2. Since 3 doesn't divide evenly by 2, it leaves a remainder of 1. That remainder is the value of the equation i%2. On the next pass (and every pass where an even number is concerned) the value of i%2 will be 0 since, for instance, 4 divides evenly by two, with a remainder of 0. Hopefully this isn't becoming more confusing than it started out being ;)

So, knowing what we now know, the "if statement" that gets run in every iteration of the outer "for loop" does a little bit of "backward figuring" (that is to say that the logic may not seem to be correct at first glance, although it is ;) The "if conditional" tests the value of i%2. For all even numbers this is equal to 0 and for all odd numbers it is equal to 1. So the statement "if ( i%2 )" is testing the value of that statement. In the case of awk, and most shell equations and their return codes, 1 equals true and 0 equals false. It's almost like saying: If "true" (the case for all odd numbers) do this and if "false" (the case for all even numbers), do the other thing... On to the next part... I never thought this would be so hard to explain ;)

This next part is pretty simple. For the odd numbers, the value of the variable sum1 is made equal to its old value plus the value of the field we're reading on this pass. If the value of i is 3, then we're checking field 3 (or $3, if you prefer). For the even numbers, the value of the variable sum2 is made equal to its old value plus the value of the field we're reading; same as with the odds. If i equals 4, we're adding the value of the i field (field 4, or $4) to the already existing value of sum2. For both instances, we're doing a simple printf to output that value to the terminal.

Now, to end it all, when the value of the variable i becomes less than or equal to the number of fields (NF) on the line (or record), we slip in a quick printf statement to print the values of the odd numbers (sum1) and even numbers (sum2) before moving on to the next line. This can be a confusing point because, as you may know, awk (like sed, etc) operates on every line of a file when you feed it one. It's for this reason that we're so careful with the bracketing. If we weren't able to control where the totals printed out, they'd print only after all the lines in the input file were processed, which isn't what we wanted.

Finally, here's the exact same equation, broken up across several lines (I "chose" to do everything on one line for yesterday's examples, because that makes it easy for me to re-run statements using my line editor in bash :) Hopefully, looking at it in this way (in a proper script-like format) will help shed some more light on the "structure" of the process, just like the previous paragraphs, hopefully, helped elucidate some of the finer points of the "execution" of the process:

awk '{sum1=$1;sum2=$2
for (i=3;i<=NF;i++) {
if ( i%2 ) {
sum1 += $i; printf"%.2f ", $i
}else {
sum2 += $i; printf"%.2f ", $i
}
}
printf"ODDS %.2f EVENS %.2f\n", sum1, sum2
}' DATASET2


if you slapped the above in a file, called it "awky.sh" and ran:

host # sh awky.awk

you'd get the exact same results as you got from the one-line version above.

Hopefully, this explanation has been helpful and, if not, please let me know what still remains a point of confusion for you. Unless there are a lot of requests for certain specific bits of info that would warrant another post, I'll be happy to help clear up little bits and pieces either via email or on the boards :)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.



, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.