Monday, December 29, 2008

Finding Your Yahoo Search Index Rank From The Unix Or Linux CLI

Hey There,

Today we're going to continue in our ongoing quest to rank highly in search engine results while simultaneously messing with them a lot ;) Previously, we've put out scripts to find your MSN search index rank from the CLI and to find your Google search index rank from the CLI. This is, of course, another script that, although it fits in the same category as the other two, is distinctive in several ways:

1. This time we're scouring Yahoo's search results.

2. The search results are being parsed differently, so that it makes it easier for the script to detect the point at which Yahoo cuts you off and won't let you do another search for a good 10 or 15 minutes.

3. The parsing code has been minimized, somewhat, so that it may actually be readable by humans who aren't pipe-chain-sed-awk-regular-expression fanatics ;)

4. See the continuation below for a very interesting analysis of Yahoo's robot-tolerance.

BUT FIRST, THIS VERY IMPORTANT NOTE: If you use wget (as we are in this script), or any CLI web-browsing/webpage-grabbing software, and want to fake the User-Agent, please be careful. Please check this online article regarding the likelihood that you may be sued if you masquerade as Mozilla, from the folks who maintain wget themselves.

This Yahoo script, of course, is only slightly different than the original Google and MSN scripts, although the differences are significant enough that most of the core was rewritten completely. The script, itself, operates the same way our Google and MSN search index page rank scripts do, insofar as executing it from the command line goes. There are, at least, three different ways you can call it. The most basic being:

host # ./yrank www.yourdomain.com all these key words

It doesn't matter if they're enclosed in double quotes or not. If you "really" want to get the double quote experience, you just need to backslash your double quotes.

host # ./yrank www.yourdomain.com \"all these key words\"

Other ways include creating files with the URL and keyword information (same format as the command line) and feeding them to the script's STDIN:

host # cat FILE|./mrank
host # ./mrank <FILE


Point 4 (continued from above): Yahoo robot search tolerance as compared with Google. This is actually quite interesting since, I believe, the general assumption is that Google is far less tolerant of seemingly-human interaction with its search than Yahoo is. However, in this case (and we've repeated this experiment over and over again) the opposite is, in fact, true. Check it out! :)

The setup is that we've created a simple file called "searchterms" to feed to both the grank and yrank scripts. It contains the following information:

host # cat searchterms
linuxshellaccount.blogspot.com unix linux
linuxshellaccount.blogspot.com linux unix
linuxshellaccount.blogspot.com unix and linux
linuxshellaccount.blogspot.com linux and unix
linuxshellaccount.blogspot.com unix
linuxshellaccount.blogspot.com linux
linuxshellaccount.blogspot.com perl script
linuxshellaccount.blogspot.com shell script


Then, we put each search engine to the test. Each grabbing results at 100 per page. You'll notice that the Google search engine makes it through the entire bunch without kicking us to the curb ;)

The image below was captured with a rear-view mirror and is, therefore, actually larger than it appears. Click below to see it in "life size" ;)

Google Robot Tolerance Test

And here is the exact same experiment; this time run against Yahoo's search engine. It ...just ...barely ...nope. It doesn't make it, again ;)

This image was taken in "Wallflower-Vision" - sometimes referred to as the "Shrinking-Violet Protocol." To see it in its natural, and unabashedly large, state, just click on it below and it will almost definitely come out of its corner ;)

Yahoo Robot Tolerance Test

Our initial suggestion is to change this line in the script (decreasing the number that you divide RANDOM by will increase the maximum wait time between tries):

let random=${RANDOM}/600


The value of 600 roughly approximates between 0 and 60 seconds of wait time. Reducing that number to 300 will roughly approximate wait times between 0 and 120 seconds, etc. The numbers generated by bash's RANDOM variable may vary depending upon your OS, system architecture, etc.

Another possibility, which we didn't have time to fully test (so it's not included in the script) is that Yahoo may actually object to the direct manipulation of the GET request and would probably respond more favorably if we extracted the URL value for our next successive request from the "Next" button on the search page, rather than moving on to the next valid (although coldly calculated) GET string to bring up the next set of 100 results. Time will tell. Experimentation is ongoing.

Hope you enjoy it, again, and you're still enjoying the holiday's. Even if none of them apply to your religious or moral belief-system, at least you get some paid time off of work :)

Cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/bash

#
# yrank - Get yer Yahoo's Out ;)
#
# 2008 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

if [ $# -lt 2 -a $# -ne 0 ]
then
echo "Usage: $0 URL Search_Term(s)"
echo "URL with or with http(s)://, ftp://, etc"
echo "Double Quote Search If More Than 1 Term"
exit 1
fi

if [ $# -eq 0 ]
then
while read x y
do
url=$x
search=$y
$0 $x "$y"
done
exit 0
else
url=$1
shift
search=$@
fi

search_terms=`echo $search|sed 's/ /+/g'`
start=1
count=1

echo "Searching for URL $url with search terms: $search"

results=`wget -O - http://search.yahoo.com/search?p=${search_terms}\&ei=UTF-8\&fr=yfp-t-501\&pstart=1\&b=$start 2>/dev/null|sed -n 2p 2>&1|sed 's/^.* of \([0-9,]*\) for .*$/\1/'`

while [ $start -lt 1001 ]
do
wget -O - http://search.yahoo.com/search?p=${search_terms}\&ei=UTF-8\&fr=yfp-t-501\&pstart=1\&b=$start\&n=100 2>&1|grep "error 999" >/dev/null 2>&1
screwed=$?
if [ $screwed -eq 0 ]
then
echo
echo "You have been temporarily barred due to excessive queries."
echo "Please change the \"random\" variable in this script to a"
echo "lower value, to increase wait time between queries, or take"
echo " 5 or 10 minutes before you run this script again!"
echo
exit 1
fi
wget -O - http://search.yahoo.com/search?p=${search_terms}\&ei=UTF-8\&fr=yfp-t-501\&pstart=1\&b=$start\&n=100 2>/dev/null|sed -n 2p 2>&1|sed 's/^.* of [0-9,]* for //'|sed 's/<[^>]*href="\([^"]*\)"[^>]*>/\n\1\n/g'|sed -e :a -e 's/<[^>]*>//g;/</N;//ba'|grep "^http"|sed '/^http[s]*:\/\/[^\.]*\.*[^\.]*\.yahoo.com/d'|sed '/cache?ei/d'|uniq|while read line
do
echo "$line"|grep $url >/dev/null 2>&1
yes=$?
if [ $yes -eq 0 ]
then
echo "Result $count of approximately " $results " results for URL:"
echo "$line"
exit 1
else
let count=$count+1
fi
done
end=$?
if [ $end -eq 1 ]
then
exit 0
else
let start=$start+100
let count=$count+100
let new_limit=$start-1
let random=${RANDOM}/600
echo "Not in first $new_limit results"
echo "waiting $random seconds..."
sleep $random
fi
done


, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.