Tuesday, January 6, 2009

Using Wget's User Agent Option Safely On Linux And Unix

Hey there,

Today's post is a follow up to some questions we've received about our previously posted scripts designed to allow you to find your search index rank on Google, Yahoo, MSN/Live and Ask.com. If you read any of those posts (our even skimmed through to get to the script at the bottom of the box ;) you've probably noticed this glaring WARNING message that we feel compelled to attach to all of these posts (except for when we fixed the Yahoo script and forgot to re-include the warning, which was corrected immediately when a helpful reader pointed out that we did just that ...or we didn't do just that ...you understand what I'm getting at better than I can write it ;)

IMPORTANT NOTE: Although this warning is on the original Google search rank index page, it bears repeating here and now. If you use wget (as we are in this script), or any CLI web-browsing/webpage-grabbing software, and want to fake the User-Agent, please be careful. Please check this online article regarding the likelihood that you may be sued if you masquerade as Mozilla.

The above is a valid concern, which is why we include it on so many pages. We don't personally know anyone whom Mozilla has stomped upon and ground into dust, but they have made the fact that they are willing to do so, very well known. This posed a bit of a conundrum, considering that browsers ranging from Firefox to Opera to every version of Internet Explorer we can remember, all have the string "Mozilla" in their "user agent" identifier, or are otherwise linked to this gargantuan beast ;) This coincidence probably has to do with standards compliance and Internet ethics (some way they can make sure that people who hit a web page with IE get served up the same content as folks who hit that same page with Firefox), which (if you recall) Microsoft has a long and storied history of ignoring, while simultaneously "convincing" the majority of Internet sites and online businesses to "do it their way" by offering simple incentives, like free Development tools, as long as the end-user agreed not to support a competitor. They "almost" got in a whole bunch of trouble in the mid-late 90's when they started strong-arming ISP's into removing Netscape Navigator (which was, at the time, one of the most popular web browsers in use) from their installation CD's and begin packaging their startup-kits with IE instead. Remember; at that point in time, IE was about the biggest piece of garbage you could conceive of using to trudge your way through the web. Whether it still is remains debatable :)

In any event, as we looked around for a suitable replacement for the current "--user-agent=" setting in our scripts, we started to notice that Firefox and Mozilla were teaming up to join forces for such and such, and even browsers like Opera where forming alliances with Mozilla and MS. The question of what user agent is safe to use, per Mozilla's sue-first-ask-questions-later policy, became more and more confusing. We know for a fact that if we just use the default of "Wget/Version-Number," all four search engines will drop us like a bad habit on our first query. Finally, we happened upon a user agent that seemed to have no apparent (or, at least, overt) alliances or connections with Mozilla: Konqueror.

For Lynx fans out there, you can use that, too, but we get mixed results when testing with it since (even though it's been around since I can remember) lots of sites consider it a bot or spider and bump it. Plus, depending upon where you live you might get in some trouble ;)

And, before you start thinking: "Konqueror? Isn't that the scrappy little POS that KDE uses as a filesystem-viewer/web-browser?", we urge you to consider that the user agent we pick to run wget queries with doesn't need to have a great looking interface. As long as it can download a web page, from wget's perspective, Konqueror is exactly equal with the finest and most graphically-beautiful web-renderers out there. Basically, it can process HTML and doesn't appear to be on the sh##-list with Mozilla ;)

Below are a couple of suggestions for replacing the user-agent portion of our Google, Yahoo, MSN/Live and Ask.com search index ranking scripts so that you can stay out of trouble and keep tabs on your SEO (or whatever other things you may be doing that are none of our business whatsoever ;).

We ask that you make these changes for your own personal security and to keep history from showing again and again how nature points out the folly of men: Mozilla! ;)

1. Simply replace "--user-agent=Firefox" with "--user-agent=Konqueror"

2. Set this option up in your .wgetrc so you never have to think about it again and can, consequently, shorten the length of all four of these scripts:

user_agent = Konqueror

3. Or, alternately, set it as an alias in your environment:

host # alias wget='wget --user-agent=Konqueror'

Of course, for all of these, if you still have issues with getting dumped too early by the search engines (we put in the random 0-60 second wait time between queries to try and not upset them too much), you can try using these other options with wget. None of these come with any warning attached that we're aware of:

--referer="http://www.pick-a-site.com" <-- This can come in handy if you're dealing with a site that will give you a different page if you go directly to the URL, rather than going to it through a specific referring URL.

--header="Accept-Encoding: gzip,deflate" <-- Don't use this unless you're sure that you can handle compressed data. If you start getting garbage back, take this option out first. This option should work find on most Linux and Unix distro's right out of the box.

--header="Keep-Alive: 300" <-- Not really necessary, but might help you when a search engine (like, say, Yahoo) tends to notice multiple single-shot queries from the same IP. This will establish some semblance of a session, allowing you to send multiple requests through one sustained connection, although it probably won't fool them for long. Just as long as it fools them long enough ;)

Here are those options as they would appear in your .wgetrc:

referer = http://www.pick-a-site.com
header = Accept-Encoding: gzip,deflate
header = Keep-Alive: 300

and, in an alias, which is exactly like typing the command, except you shouldn't have to ever do it again for your entire login session (or for even longer, if you include it in your proper .profile, .bashrc, etc, login initialization file):

alias wget='wget --referer="http://www.pick-a-site.com" --header="Accept-Encoding: gzip,deflate" --header="Keep-Alive: 300"'

We'll be sure and post updated data as we get a chance to test and see how long we can keep Yahoo going before they bounce us (if it hasn't already been reduced to 1 hit by a Yahoo employee who can't stand this blog ;)


, Mike

Discover the ClickBank affiliate program that pays 100% commission!

Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.