The Linux and Unix Menagerie: Webserver Access Log HTML Element Counting

Tuesday, March 11, 2008

Webserver Access Log HTML Element Counting - Improved!

Hey There,

Today's post is a follow up on a post we did not too long ago on using bash to help report on web server usage. In it, we introduced a now somewhat-infamously top-heavy bash shell script to look for whatever elements you wanted to find in your web server's access log and report counts on them.

Thanks for today's new and improved version goes to a gentleman by the name of Phil who was kind enough to show me a thing or two by posting his own rewrite of that element count script on our forum :) While I was approaching the script with a more "editor-trick" stream-of-conscience approach, my script was a bit slow. Since I was tracking all my grep's quietly, then checking the status of errno and then updating my counts for all elements invidually on each line, it had the built in potential to take some time to run.

I think you'll like what Phil has done with it and appreciate the brevity of his version of the script. It also runs much cleaner and faster. For instance, when compared with my original script, the timing differential was significant when run against an access log with only about 500 lines.

The following are real tested numbers (I would never lie about my script being slower ;)

host ./original_htmlElementcount.sh:

real 0m3.358s
user 0m1.201s
sys 0m2.216s

host ./new_htmlElementcount.sh

real 0m0.108s
user 0m0.092s
sys 0m0.020s

And, for those of you asking the question, the difference between 3 seconds and 1/10th of 1 second can become astronomical, assuming an extension in time-to-run commensurate with the size of the log file. Most access files are much larger than 500 lines (At companies that can still afford to pay the electric ;)

I think the biggest point to take away from this is that the shell can do things a lot faster on its own than when it has to pull in external commands. You'll notice that, in the new script, the entire file is read and acted upon all at once and only once. In the previous reporting script, we iterated through each line and did repetitive work. You'll notice also, that egrep and mixed-case range operators are used to do the matching, which, as it turns out, is quite a bit faster than using the -i option to grep.

And the best thing of all is that this is still a great example of porting from our Perl log element reporting script. In fact, it probably helps make that Perl script, and the porting process, more easily understandable by showing it from an alternate perspective!

Thanks, again, for your contribution, Phil :)

Cheers,

#!/bin/bash

#
# htmlElementcount.sh


function die {
   echo "$*" >&2 ; exit 1
}

[ 1 -ne $# ] && die "usage: $(basename $0) LOG_FILE"
[ ! -e  $1 ] && die "LOG_FILE [$1] does not exist"
[ ! -s  $1 ] && die "LOG_FILE [$1] is empty"

cat <<EOF
Page Hits
$(wc -l < $1 | tr -d ' ') pages accessed - Form Elements Processed:
$(egrep -c '[.][Hh][Tt][Mm][Ll]' $1| tr -d ' ') html pages accessed
$(egrep -c '[.][Gg][Ii][Ff]' $1| tr -d ' ') GIF files accessed
$(egrep -c '[.][Jj][Pp][Gg]' $1| tr -d ' ') jpg files accessed
EOF

exit 0