Wednesday, August 13, 2008

Finding Your Google Index Rank With The Linux Or Unix CLI

Hey There,

Today, I'm throwing out a little script that's most definitely a "work in progress." I ended up slapping this together after doing a lot of site search engine monitoring (it's hard work, doing all that clicking ;) Every time I post, I'm constantly curious about where my site stands with regards to certain keywords or keyword phrases. Of course, it doesn't make all the difference in the world, but when you combine an obsessive-compulsive personality and a problem, no matter how goofy, there's always an answer :)

VERY IMPORTANT NOTE: This script use's wget's --user-agent option. DO NOT use "Mozilla" as your user agent! Check out this page, near the bottom regarding being sued for masquerading as Mozilla.

The script attached below can be called (and fed arguments) in a few different ways. In essence, what it does is search Google (using the search keyword or words you provide) and finds the index of the first occurrence of the URL you entered in Google's search database. Of course its the definition of a cheap hack since I'm not using Google's API (unless they've adopted bash ;)

Below are a few of the ways you can call the program (click on any picture to make it larger). Basically, you can supply the arguments to the program, echo your args through a pipe to the program, cat a file through a pipe to the program or redirect a file's STDIN to the program. Examples below:

grank usage

And the corresponding Google web page:

google search results

One thing that's interesting to note (among the million other things - like not identifying CAPTCHA and opting to just use random wait times between queries - that aren't perfect about this rough draft of a script) is that the index numbers get skewed the higher up you go. I still have to figure out how to compensate for the "Google stagger" (when one result returns a result plus an additional indented result). Check out the following query using "grank", which puts us at positions 482 and 483:

grank index

and you can see that the actual index in Internet Explorer is more like 539 and...

ie index

Firefox shows the result at 530:

firefox index

Still, you get a pretty good idea, from these results, that the search query "shell process accounting" isn't going to bring me any traffic... darn ;)

I hope you enjoy the script and find some good use for it :)

Cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/bash

# grank - find your google rank index
#
# 2008 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

if [ $# -lt 2 -a $# -ne 0 ]
then
echo "Usage: $0 URL Search_Term(s)"
echo "URL with or with http(s)://, ftp://, etc"
exit 1
fi

if [ $# -eq 0 ]
then
while read x y
do
url=$x
search_terms=$y
$0 $x "$y"
done
exit 0
else
url=$1
shift
search_terms=$@
fi

base=0
num=1
start=0
multiple_search=0
not_found=0

for x in $search_terms
do
if [ $multiple_search -eq 0 ]
then
search_string=$x
multiple_search=1
else
search_string="${search_string}+$x"
fi
done

echo "Searching For Google Index For $url With Search Terms: $search_terms..."
echo

num_results=`wget -q --user-agent=Firefox -O - http://www.google.com/search?q=$search_string\&hl=en\&safe=off\&pwst=1\&start=$start\&sa=N|awk '{ if ( $0 ~ /of about <b>.*<\/b> for/ ) print $0 }'|awk -F"of about" '{print $2}'|awk -F"<b>" '{print $2}'|awk -F"</b>" '{print $1}'`

while :;
do
if [ $not_found -eq 1 ]
then
break
fi
wget -q --user-agent=Firefox -O - http://www.google.com/search?q=$search_string\&num=100\&hl=en\&safe=off\&pwst=1\&start=$start\&sa=N|sed 's/<a href=\"\([^\"]*\)\" class=l>/\n\1\n/g'|awk -v num=$num -v base=$base '{ if ( $1 ~ /^http/ ) print base,num++,$NF }'|awk '{ if ( $2 < 10 ) print "Google Index Number " $1 "0" $2 " For Page: " $3; else if ( $2 == 100 ) print "Google Index Number " $1+1 "00 For Page: " $3;else print "Google Index Number " $1 $2 " For Page: " $3 }'|grep -i $url
if [ $? -ne 0 ]
then
let start=$start+100
if [ $start -eq 1000 ]
then
not_found=1
if [ $not_found -eq 1 ]
then
break
fi
fi
let base=$base+1
first_page=0
else
break
fi

let sleep_time=${RANDOM}/600
echo "Not In Top $start Results: Sleeping $sleep_time seconds..."
sleep $sleep_time
done

if [ $not_found -eq 1 ]
then
echo "Not Found In First 1,000 Index Results - Google's Hard Limit"
echo
fi

echo "Out Of Approximately $num_results Results"
echo
exit 0

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.