Thursday, August 28, 2008

Online Encyclopedia Shell Script For Linux And Unix

Hey There,

Today's Linux/Unix bash shell script is a follow up (and the finishing touch, I hope ;) to a whole slew of scripts we've written to mine the knowledge on tap at reference.com. If you missed any of the others, you can still find them in our older bash script posts to access the online Thesaurus and, of course, the online dictionary. This time we're breaking the chains and losing the albatross by finally attacking (with some level of precision) the Online Encyclopedia and producing a shell script does it as much justice as we could muster.

Note added after initial post: Forgot to include the Online Language Translation Script. It fits in with this bunch fairly well :)

One of the major differences you'll note between this script and the others is that we've included a "pager" variable. If you've ever checked out the reference.com Online Encyclopedia you're aware that the average return on a query is substantially larger than what you get back from a typical dictionary definition or thesaurus synonym query (not to mention the size of your terminal window). The screen shots we have for you, below, were the result of painstaking effort trying to find a word or phrase that didn't have an exceedingly long and thorough write-up. We chose "/usr/bin/more" as the pager for this script, but it can be easily changed to "/usr/bin/less," or whatever your favorite pager program happens to be.

After toying with this project for a while, trying to figure out why some pages worked, some didn't and where in the good Lord's name any consistent marker could be found to create consistent output, I've actually kind of gotten into using it. Now, if I'm at my terminal "working my fingers to the bone ;) " I can just type:

host # ./ency.sh engelbert humperdinck

and try to figure out what my generation "doesn't" see in him ;) ...apologies to die-hard fans.

Like our previous scripts, this script uses wget and sed. This script was also an attempt at "brute force scripting" (just writing it as we think it), but it ended up being so aggravating trying to find the right mix of regular expressions to get back decent results for the broadest amount of queries, that I think the method can, now, only be referred to as "blunt force scripting" (that's our new term for scripting as if you'd just been hit in the head with a brick ;). There are a few places the script could be tightened-up, but, to be perfectly honest with you, we're scared to death of changing anything about it right now. Maybe later, when we've steeled our nerves ;)

Below are a few screen shots of the script's output. The first is for an actual "one-pager" on a query for a pretty decent novel named Glamorama and the other is what you can expect to see when your query returns the equivalent of 15 leatherbound volumes of text on a query for Robert DeNiro. The last, just shows common error messages we generated on purpose.

Possibly interesting fact: If anyone out there is having an 80's moment and vaguely recalls Toni Basil ("oh, Mickey, you're so fine, you're so fine, you blow my mind..." why do I know the lyrics? ;), you'd be surprised to know that even "her" entry spanned multiple pages. I was actually surprised to find that she's also an actress and starred with Jack Nicholson in Five Easy Pieces under the very same name.

Click on the captures below to see them all in IMAX:

Glamorama Encyclopedia Entry

Robert De Niro Encyclopedia Entry

Encyclopedia Shell Script Errors

As one last bit of explanation. There are 2 specific error conditions that we ended up throwing in here. One is pretty obvious, involving No Results being found. The other had me going for a little while, since I was positive that a query on "linux" must have some return, since a query for "unix" did. It turns out that, when the Online Encyclopedia returns way too many results (I believe around 752 for "linux") it will state that it found a whole ton of results and is only going to show you one. The glitch in the program here is that, at some large number, it says that but then doesn't return an actual entry. I ended up noticing this when I finally broke down and parsed the HTML return and saw nothing, followed by a visit to the website via browser where I noticed the condition. You can see what I'm talking about right here, unless they've fixed it already (or it only happens intermittently). In any event, we wrote a check so that, if there are a large number of results, and it does return one, you'll still get it, but you'll get a self-explanatory error message if you hit this bug. Not that it needs to be said, but every query returns more than one result (insofar as our field-testing has gone), and the default behaviour of the site is to only return the top "one" and include links to the others.

Hope you enjoy this script, and can find some use for it. Any suggestions for improvement would be greatly appreciated. Be forewarned; there is the possibility that some of your queries may come back with extra parts added on that we didn't catch (usually tacked on to the end... phew), which could also make the output look horrible. This is the best we've got "for now" :)

I'm just glad somebody already figured out how to order Pizza online so I don't have that hanging over my head ;)

Cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/bash

#
# ency.sh - More than you ever wanted to know about anything
#
# 2008 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

if [ $# -lt 1 ]
then
echo "Usage: $0 Your Encyclopedia Terms"
echo "May be two or more words separated by spaces"
echo "but only one definition per execution."
echo "Ex: $0 money"
echo "Ex: $0 the beatles"
exit 1
fi

args="$@"
wget=/usr/bin/wget
pager=/usr/bin/more

if [ $# -gt 1 ]
then
args=`echo $args|sed 's/ /%20/g'`
fi

echo

$wget -nv -O - http://www.reference.com/search?q="$args" 2>&1|grep -i "No results found" >/dev/null 2>&1

anygood=$?

$wget -nv -O - http://www.reference.com/search?q="$args" 2>&1|grep -i "<cite>" >/dev/null 2>&1

toomuch=$?

if [ $anygood -eq 0 ]
then
args=`echo $args|sed 's/%20/ /g'`
echo "No results found for $args"
exit 2
fi

if [ $toomuch -ne 0 ]
then
args=`echo $args|sed 's/%20/ /g'`
echo "Too many results returned for $args"
echo "Try doing a broader query - For Ex:"
echo "$0 linux = Too many results"
echo "$0 linux os = Information!"
echo "Double check at www.reference.com"
echo "to see the difference"
exit 3
fi

$wget -nv -O - http://www.reference.com/search?q="$args" 2>&1|sed '1,/Cite This Source/d'|sed '/External links\|See also\|More from Wikipedia/,$d'|sed -e :a -e 's/<[^>]*>/ /g;/</N;//ba' -e 's/$/\n/'|$pager

exit 0


, Mike




Douglas Taylor submitted these enhancements to make the output look much nicer. Thanks, Douglas!


1) Run the output through fmt before outputting. It will wrap the paragraph text fairly nicely, although it kind of destroys tabular data, and the indenting can get weird, or

2) Use lynx with the -dump option instead of wget. I replaced the final wget command with

lynx -dump http://www.reference.com/search?q="$arg" |sed '1,/Cite This Source/d'|sed '/External links\|See also\|More from Wikipedia/,$d'|sed 's/\[[0-9]*\]//g'|$pager

and the result looks fairly nice.


Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.