Thursday, January 15, 2009

Bash script to find all of your indexed web pages on Google

Hey there,

Today's Linux and/or Unix bash script is a kind-of add-on for our original bash script to find your Google search index rank from last August (2008 - for those of you who might be reading this in some barren wasteland in a post-apocalyptic future where, for some inexplicable reason, my blog posts are still legible and being transmitted over the airwaves ...or delivered in paper format by renegade-postman Kevin Costner ;)

While the original script would search Google for any link to the URL you specified, based on a search of any number of keywords, or keyword phrases, you supplied (up to 1000 results, when Google cuts off), this one checks to see if your site is indexed by Google, and how heavily. I know that sounds like almost the exact same thing, but it's slightly different, I promise :)

This version (named "gxrank" since the original script was named "grank," I needed to differentiate between the two in my bin directory and - I'll admit - I wasted not one ounce of imagination on the title ;) will accept a URL you supply and check Google's index for your site. It will then let you know how many pages, and which ones, are in Google's index.

While this might not seem like a valuable script to have, I use it a lot to test how fast I can put sites up for people and get them indexed. For instance, I might put up a site today and do all the standard SEO falderal. Of course, I won't run this script that night, but, usually by the next day, I'll be able to run this script and, at least, get back 1 result for the base URL. Then, over time, I can run this script (usually once per day, so I can feel impressed ;) and see how many pages on my site have been indexed. This script was, most specifically, made to track high-activity blogs (at least a post a day), but I use it for smaller sites, as well. If I put a site up that has 10 pages, it's nice to know when (or should I say if? ;) those 10 pages get fully indexed.

The usage for the script is fairly simple. You can run it from the Linux or Unix command line like:

host # ./gxrank yourUrl.com

You don't need to include the http:// or any other stuff. It's basically a regular expression match, so you can just include enough of a semi-URL to make sure you get back relevant results. You'll notice that I also only have it printing out the first 100 results maximum (you can modify this so it doesn't show you anything, if you want - I just like to see it. Google gives strange results when the amount of returns is less than the amount of maximum results on any given page. Like, it might list 38 index entries and then say that it got that many out of approximately 57 results. They're probably removing similar results but, ultimately, the exact number of index entries isn't all that important since it fluctuates often)

As a "for instance," below is the output of the script when run to check the indexed pages of gotmilk.com (Not for any particular reason; they were just the first site I ran across that didn't return more results than would fit on my screen ;)

Note: Click on the image below to "virtually experience" the bends ;)

gotmilk.com indexed pages on Google

Hope you enjoy this script and get some good use out of it. Sometimes just seeing your numbers grow can pick you up when you're ready to throw in the towel on your website :)

Cheers,

IMPORTANT NOTE: Although this warning is on the original Google search rank index page, it bears repeating here and now. If you use wget (as we are in this script), or any CLI web-browsing/webpage-grabbing software, and want to fake the User-Agent, please be careful. Please check this online article regarding the likelihood that you may be sued if you masquerade as Mozilla.


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/bash

#
# gxrank - how many pages does Google have in its index for you?
#
# 2009 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

if [ $# -ne 1 ]
then
echo "Usage: $0 URL"
echo "URL with or with http(s)://, ftp://, etc"
exit 1
fi

url=$1
shift

base=0
start=0
not_found=0
search_string="site:$url"

echo "Searching For Google Indexed Pages For $url..."
echo

num_results=`wget -q --user-agent=Firefox -O - http://www.google.com/search?q=$search_string\&hl=en\&safe=off\&pwst=1\&start=$start\&sa=N|awk '{ if ( $0 ~ /of about <b>.*<\/b> from/ ) print $0 }'|awk -F"of about" '{print $2}'|awk -F"<b>" '{print $2}'|awk -F"</b>" '{print $1}'`

while :;
do
if [ $not_found -eq 1 ]
then
break
fi
wget -q --user-agent=Firefox -O - http://www.google.com/search?q=$search_string\&num=100\&hl=en\&safe=off\&pwst=1\&start=$start\&sa=N|sed 's/<a href=\"\([^\"]*\)\" class=l>/\n\1\n/g'|awk -v num=$num -v base=$base '{ if ( $1 ~ /^http/ ) print base,num++,$NF }'|awk '{ if ( $2 < 10 ) print "Google Index Number " $1 "0" $2 " For Page: " $3; else if ( $2 == 100 ) print "Google Index Number " $1+1 "00 For Page: " $3;else print "Google Index Number " $1 $2 " For Page: " $3 }'|grep -i $url
if [ $? -ne 0 ]
then
not_found=1
if [ $not_found -eq 1 ]
then
break
fi
else
break
fi

done

if [ $not_found -eq 1 ]
then
echo "Finished Searching Google Index"
echo
fi

echo "Out Of Approximately $num_results Results"
echo
exit 0


, Mike




Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!



Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.