Wednesday, November 19, 2008

Convoluted Column Arithmetic Examples Using Awk On Linux Or Unix

Hey There,

Today's going to be a "fun with awk" day. I figure we should have one now because we never had them in school when I was a kid... The topic, as the title suggests, has to do with columnar arithmetic or, less pompously put, performing arithmetic operations on columns or, even more accessibly, adding stuff up ;) It's somewhat like our older post on doing simple cumulative math with awk, but slightly more confusing.

The basic premise here is that you, I and the next guy have all seen the solution to many a question like: "Given a file filled with rows of space delimited numbers, how do I add up all the values for field 2 ($2) in awk so that I can get the total when awk is through processing my file?" ...or something to that effect. Knowing the fundamentals of performing arithmetic in awk can be a great asset to you at work (or, at play, if you like to add, subtract, multiply and divide for kicks ;), but, for the most part, only the most banal instances of its application are either used or queried for on the web (at least as far as my search engine testing has shown so far).

So, without further fanfare, we'll take a look at two sets of data (below) and some various (hopefully, not utterly pedestrian) ways to sort and manipulate the data. These are pulled from a series of posts I answered on the Daniweb.com Shell Scripting Forum regarding odd and even column operations (I try to only plagiarize myself ;). I found the questions interesting to answer and, hopefully, you'll find both the requests and their resolutions of some use, as well :) I've modified some of the solutions that I posted to the board for this blog post, as I found that, at some points, my quick (trying to be helpful) answers could be done a bit better with a little extra attention. Also, the entire thread isn't played out here and some additional problems haven't been included. There's only so much I can type ;)

The setup involves a file that contains these numbers in the columns and rows - or fields and records, as we'll refer to them from here on out - shown below. The initial problem uses this dataset. The result of the initial solution forms the dataset for all the problems to follow (shown below, in, hopefully, sensible order):

Data Set 1:

55 65 48 45 48 68 32 68 44 34 88 65
82 63 52 54 51 68 75 0 0 20 10 77
55 77 60 55 22 60 40 25 75 55 45 90
20 80 33 63 0 64 32 22 75 0 43 56
54 54 12 35 48 87 65 12 77 85 0 15


Problem 1: Given "Data Set 1" (conveniently placed in a file called "DATASET1"), for each pair of successive fields , using awk's built in arithmetic operators, convert each field by dividing it by the sum of the pair and multiplying each by 100 to produce recognizable relative-percentage output. So, for the first column, you would have the results of equations of the following order:

$1/$1+$2, $2/$1+$2, $3/$3+$4, $4/$3+$4, etc....

Solution 1: This can be worked out by the following awk command (technically not a "one-liner" since it exceeds the 65 character limit):

host # awk 'BEGIN{lasti=1}{for (i=1;i<=NF;i++) {if ( i%2==0 ) {y=$lasti+$i;printf("%.2f %.2f ", $lasti/y*100, $i/y*100)}lasti=i }printf"\n" }' DATASET1

This solution produces the following output, which will become "Data Set 2" (which will be put in a file called.... DATASET2. I couldn't come up with anything more clever under pressure ;)

45.83 54.17 51.61 48.39 41.38 58.62 32.00 68.00 56.41 43.59 57.52 42.48
56.55 43.45 49.06 50.94 42.86 57.14 100.00 0.00 0.00 100.00 11.49 88.51
41.67 58.33 52.17 47.83 26.83 73.17 61.54 38.46 57.69 42.31 33.33 66.67
20.00 80.00 34.38 65.62 0.00 100.00 59.26 40.74 100.00 0.00 43.43 56.57
50.00 50.00 25.53 74.47 35.56 64.44 84.42 15.58 47.53 52.47 0.00 100.00


Problem 2: From "Data Set 2," take each line and add up the all the odd fields and all the even fields per record. Then, as an added bonus, add up all the odd fields and all the even fields in all records. This was solved by the following awk statement (Note that, on the board, the sum for the first part of this problem was not printed as part of the solution, since it was being manipulated via an external array elsewhere within a script. For this example, I'm putting the totals for each line at the end of each line and anchoring them with text for easy distinction):

host # awk '{sum1=$1;sum2=$2;for (i=3;i<=NF;i++) {if ( i%2 ) {sum1 += $i; printf"%.2f ", $i}else {sum2 += $i; printf"%.2f ", $i}}printf"ODDS %.2f EVENS %.2f\n", sum1, sum2}' DATASET2

51.61 48.39 41.38 58.62 32.00 68.00 56.41 43.59 57.52 42.48 ODDS 284.75 EVENS 315.25
49.06 50.94 42.86 57.14 100.00 0.00 0.00 100.00 11.49 88.51 ODDS 259.96 EVENS 340.04
52.17 47.83 26.83 73.17 61.54 38.46 57.69 42.31 33.33 66.67 ODDS 273.23 EVENS 326.77
34.38 65.62 0.00 100.00 59.26 40.74 100.00 0.00 43.43 56.57 ODDS 257.07 EVENS 342.93
25.53 74.47 35.56 64.44 84.42 15.58 47.53 52.47 0.00 100.00 ODDS 243.04 EVENS 356.96


and for part 2 of problem 2 - to get the grand total of all even fields and grand total of all odd fields:

host # awk 'BEGIN{sum1=$1;sum2=$2}{for (i=1;i<=NF;i++) {if ( i%2 ) sum1 += $i;else sum2 += $i}}END{print "ODDS " sum1 " EVENS " sum2}' DATASET2

ODDS 1318.05 EVENS 1681.95


Problem 3: Given "Data Set 2," how could you divide every odd field by the grand total of all odd fields, and divide every even field by the grand total of all even fields (with minor modification to set the decimal precision from 2 to 4, since the resulting values will be well below 1% ;) - For example, you would want output that went through the data file and did "$1/1318.05 $2/1681.95, $3/1318.05..." etc. This can be accomplished by the following:

host # awk 'BEGIN{odds=1318.05;evens=1681.95}{for (i=1;i<=NF;i++) {if ( i%2 ) {odd=$i/odds; printf"%.4f ", odd}else {even=$i/evens; printf"%.4f ", even}}printf"\n";}' DATASET2

0.0348 0.0322 0.0392 0.0288 0.0314 0.0349 0.0243 0.0404 0.0428 0.0259 0.0436 0.0253
0.0429 0.0258 0.0372 0.0303 0.0325 0.0340 0.0759 0.0000 0.0000 0.0595 0.0087 0.0526
0.0316 0.0347 0.0396 0.0284 0.0204 0.0435 0.0467 0.0229 0.0438 0.0252 0.0253 0.0396
0.0152 0.0476 0.0261 0.0390 0.0000 0.0595 0.0450 0.0242 0.0759 0.0000 0.0330 0.0336
0.0379 0.0297 0.0194 0.0443 0.0270 0.0383 0.0640 0.0093 0.0361 0.0312 0.0000 0.0595


and here's an alternate way to get the same thing (combining problems 2 and 3 with the assumption that we don't know the exact values of the "odds" and "evens" variables I populated in the BEGIN section of problem 3):

host # echo `awk 'BEGIN{sum1=$1;sum2=$2}{for (i=1;i<=NF;i++) {if ( i%2 ) sum1 += $i;else sum2 += $i}}END{printf"%.4f %.4f", sum1,sum2}' DATASET2`|while read x y;do awk -v odds=$x -v evens=$y '{for (i=1;i<=NF;i++) {if ( i%2 ) {odd=$i/odds; printf"%.4f ", odd}else {even=$i/evens; printf"%.4f ", even}}printf"\n"}' DATASET2;done

0.0348 0.0322 0.0392 0.0288 0.0314 0.0349 0.0243 0.0404 0.0428 0.0259 0.0436 0.0253
0.0429 0.0258 0.0372 0.0303 0.0325 0.0340 0.0759 0.0000 0.0000 0.0595 0.0087 0.0526
0.0316 0.0347 0.0396 0.0284 0.0204 0.0435 0.0467 0.0229 0.0438 0.0252 0.0253 0.0396
0.0152 0.0476 0.0261 0.0390 0.0000 0.0595 0.0450 0.0242 0.0759 0.0000 0.0330 0.0336
0.0379 0.0297 0.0194 0.0443 0.0270 0.0383 0.0640 0.0093 0.0361 0.0312 0.0000 0.0595


And, hopefully, that's enough awk for now. Check out the Daniweb.com Shell Scripting Forum boards to see some of the great solutions other folks come up with (for some pretty bizarre problems) and, if it's just not your day, a few of my other monstrosity's ;)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.