Friday, January 2, 2009

Yahoo Search Script Fixed - Looking Backward At My Linux And Unix Mistakes

Good morning,

It's actually almost 9pm on New Year's day and I'm just over my initial waking stupor. Surprisingly, I actually had a clear thought while I was cursing myself for not pre-writing the one post I "knew" I'd be too zonked to write coherently ;)

In any event, I fixed the original Yahoo CLI search index rank script that we put out last week, and added a few extra touches (like some spacing which isn't shown in the example pictures today because I needed to be able to squeeze the results onto my screen to satisfy my screen capture software - ScreenPrint32, btw, if you're looking for something that has full functionality and an unlimited use policy. I registered mine so my conscience wouldn't nag me about using it for this blog all the time, but it's not necessary).

IMPORTANT NOTE: Although this warning is on the original Google search rank index page, it bears repeating here and now. If you use wget (as we are in this script), or any CLI web-browsing/webpage-grabbing software, and want to fake the User-Agent, please be careful. Please check this online article regarding the likelihood that you may be sued if you masquerade as Mozilla.

It turns out, the main SNAFU (situation normal, all f***ed up) was that I didn't include a --user-agent option to my wget command... ooops.

Check out the test run pictures below, to see what a difference it makes, and feel free to laugh "with" or "at" me for the bone-headed mistake. I'm immune to feeling any worse right now ;)

For the newer pictures, I added a line to the "searchterms" file to make sure I could get a result from the last search after the few preceding it failed. Also, for this test, like the last, I modified the script slightly to only go after the first 200 results, rather than the default maximum of 1000 that Yahoo will return per search. As the newer pictures demonstrate, you don't need to go after the full 1000 like we did last time in order to get bounced.

host # cat searchterms
linuxshellaccount.blogspot.com unix linux
linuxshellaccount.blogspot.com linux unix
linuxshellaccount.blogspot.com unix and linux
linuxshellaccount.blogspot.com linux and unix
linuxshellaccount.blogspot.com unix
linuxshellaccount.blogspot.com linux
linuxshellaccount.blogspot.com perl script
linuxshellaccount.blogspot.com shell script
linuxshellaccount.blogspot.com killing zombie processes


Here's the output from the last version in action with a short "searchterms" file:

Click the picture below to Biggie-Size it ;)

The Old Yahoo Results

And here's output from the new version with a slightly longer "searchterms" file:

Click the picture below and prepared to be impressed only slightly more than you may already be ;)

Much Better Yahoo Results

And, of course, here's the control-test using the older version with the newer "searchterms" file:

Click the picture below if you have yet to be stricken by awe today ;)

The Bad Yahoo Results Tested Again

And here's the latest revision of the script. Enjoy not being booted by Yahoo for a bit longer than usual ;)

Cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/bash

#
# Yahoo CLI search v2 -- fixed some really stupid mistakes ;)
#
# 2009 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

if [ $# -lt 2 -a $# -ne 0 ]
then
echo "Usage: $0 URL Search_Term(s)"
echo "URL with or with http(s)://, ftp://, etc"
echo "Double Quote Search If More Than 1 Term"
exit 1
fi

if [ $# -eq 0 ]
then
while read x y
do
url=$x
search=$y
$0 $x "$y"
done
exit 0
else
url=$1
shift
search=$@
fi

search_terms=`echo $search|sed 's/ /+/g'`
start=1
count=1

echo
echo "Searching YAHOO for URL $url with search terms: $search"
echo

results=`wget -O - --user-agent=Firefox http://search.yahoo.com/search?p=${search_terms}\&ei=UTF-8\&fr=yfp-t-501\&pstart=1\&b=$start 2>/dev/null|sed -n 2p 2>&1|sed 's/^.* of \([0-9,]*\) for .*$/\1/'`

while [ $start -lt 1001 ]
do
wget -O - --user-agent=Firefox http://search.yahoo.com/search?p=${search_terms}\&ei=UTF-8\&fr=yfp-t-501\&pstart=1\&b=$start\&n=100 2>&1|grep "error 999" >/dev/null 2>&1
screwed=$?
if [ $screwed -eq 0 ]
then
echo
echo "You have been temporarily barred due to excessive queries."
echo "Please change the \"random\" variable in this script to a"
echo "lower value, to increase wait time between queries, or take"
echo " 5 or 10 minutes before you run this script again!"
echo
exit 1
fi
wget -O - --user-agent=Firefox http://search.yahoo.com/search?p=${search_terms}\&ei=UTF-8\&fr=yfp-t-501\&pstart=1\&b=$start\&n=100 2>/dev/null|sed -n 2p 2>&1|sed 's/^.* of [0-9,]* for //'|sed 's/<[^>]*href="\([^"]*\)"[^>]*>/\n\1\n/g'|sed -e :a -e 's/<[^>]*>//g;/</N;//ba'|grep "^http"|sed '/^http[s]*:\/\/[^\.]*\.*[^\.]*\.yahoo.com/d'|sed '/cache?ei/d'|uniq|while read line
do
echo "$line"|grep $url >/dev/null 2>&1
yes=$?
# echo "DEBUG: $line"
if [ $yes -eq 0 ]
then
echo "Result $count of approximately " $results " results for URL:"
echo "$line"
exit 1
else
let count=$count+1
fi
done
end=$?
if [ $end -eq 1 ]
then
exit 0
else
let start=$start+100
let count=$count+100
let new_limit=$start-1
let random=${RANDOM}/600
echo "Not in first $new_limit results"
echo "waiting $random seconds..."
sleep $random
fi
done

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.