Showing posts with label php. Show all posts
Showing posts with label php. Show all posts

Thursday, April 23, 2009

Beginning Modifications To Our Internet Mass Downloader For Linux And Unix

Hey there,

Today, we've got a little update (in need of some more updating) for those of you who like to scrape the web for pictures using our mass URL downloading Perl Script, or your own variation thereof. It should run on virtually any version or distro of Linux or Unix with Perl installed. If not, hopefully, it should only need minor modification.

NOTE: This update only addresses about 80% of the problems we've encountered. We'll post an update as soon as we figure out how to get around those ornery remaining 20% ;)

It seems that the folks at imagevenue (and other multimedia holding-tanks) have gotten around to changing the way they do their PHP redirects. Those are the annoying little scripts that open up a new window when you click on a hyperlink and then redirect you to another location which either contains the picture (or whatever) you want to download or (in some extreme cases) contain even more redirection. Of course, we don't blame them. What we're doing here, by breaking through all that nonsense to try and automate it in a Perl script, isn't unethical, but we understand that it might be a pain in the arse ;) And I'm sure (from what I see on download.com from time to time) that we're probably in the minority of people putting the hurt on them (and God bless them for still sticking around :)

This update is being presented in the form of a patch (If you need any help applying this one, check out this old post on using patch the easy way or, if you're familiar with "patch," just follow the simple prompts below to apply the attached patch below (created using "diff -c"). We've also included the same "dUpeDL" script that the Perl script calls (based on the findDupeFiles script by Cameron Hayne (macdev@hayne.net) - with full attribution and original headers included in the header of that fantastic "MD5 checksum + Size" duplicate checker).

In order to update your old version of "dUrl" (Check the above link if you need to download the latest version of the source), just download the original version (also, check out this post for some ideas about how to creatively download scripts from this blog; they sometimes cut and paste out as one continuous line!) and do the following (We're assuming your original script is called "dUrl" and our patch is called "dUrl.patch"):

host # cp dUrl dUrl.bak
host # wc -l *
325 dUrl
130 dUrl.patch
325 dUrl.bak
host # patch -p0 dUrl dUrl.patch
patching file dUrl

host # wc -l *;ls -l *
335 dUrl
325 dUrl.bak
130 dUrl.patch


Check the above link, also, for the easy way to back out the patch if you don't care for the mods. Also, once you're done, be sure to change all the "/home/mgolvach.." or "/users/..." paths that call the dUpeDL script to wherever you have that script located on your machine :)

Cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License


Begin Patch

*** dUrl Wed Apr 22 20:12:33 2009
--- dUrl.new Wed Apr 22 20:16:17 2009
***************
*** 1,7 ****
#!/usr/local/bin/perl

#
! # 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License</a>
#
--- 1,7 ----
#!/usr/local/bin/perl

#
! # 2009 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001b
#
# <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License</a>
#
***************
*** 189,194 ****
--- 189,196 ----
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
+ $phpcounter=1;
+ $phpurl = $dl_req;
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
***************
*** 237,244 ****
}
}
if ( $dl_req =~ /php\?/ ) {
! $dl_req =~ s/\&/\\&/g;
! system("wget -q $dl_req");
} else {
system("wget -q $dl_req");
}
--- 239,255 ----
}
}
if ( $dl_req =~ /php\?/ ) {
! if ( $dl_req =~ /img.php/ ) {
! $phpender = $dl_req;
! $phpstarter = $dl_req;
! $phpstarter =~ s/^(http:\/\/[^\/]*\/).*$/$1/;
! $phpender =~ s/^.*image=(.*)$/$1/;
! $phpcontent = "${phpstarter}$phpender";
! system("wget -q $phpcontent");
! } else {
! $dl_req =~ s/\&/\\&/g;
! system("wget -q $dl_req");
! }
} else {
system("wget -q $dl_req");
}
***************
*** 251,268 ****
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
! print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
! if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
! $php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;
!
$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
--- 262,278 ----
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
! print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
! if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
! $php_tail =~ s/.*SRC=\"([^\"]*)\".*/$1/;
$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
***************
*** 276,282 ****
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
! $php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
--- 286,292 ----
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
! $php_real_url =~ s/.*url=(http.*?\.[jgp][pin][gf]).*/$1/;
}
}
}
***************
*** 309,315 ****
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
! system("/users/mgolvach/bin/dUpeDL");
chdir("$this_dir");
$counter++;
}
--- 319,325 ----
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
! system("/export/home/users/dUpeDL");
chdir("$this_dir");
$counter++;
}



End Patch

---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net)

#!/usr/local/bin/perl

#
# dUpeDL - Based on the following script - only slightly modified to work with dURL
# Below: The original liner notes for full attribution to the original author.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}


, Mike




Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!



Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Thursday, February 26, 2009

Simple Site Redirection On Apache For Linux Or Unix

Hey there,

Today we're going to look at simple ways you can do site redirection using a few basic methods for Apache on Linux or Unix. I'd presume that the functionality works exactly the same for Apache on Windows, but that would (to misquote a cliche) make a "pres" out of "u" and "me" - yeah, the "assume" version of that plays much nicer ;) Tomorrow, we're going to follow up with more advanced ways to do site and subdomain redirects using "mod rewrite".

Site redirection, in and of itself, is fairly easy to do with just straight-up HTML or PHP, but it has its limitations. For instance, if you wanted everyone who visited www.myxyz.com to be redirected to www.myotherxyz.com, all you'd have to do would be to add some simple code to your index.htm(l) or index.php file. In HTML, you could just add this line in the <head> section:

<meta http-equiv="refresh" content="30;url=http://myotherxyz.com/">

This would allow for thirty seconds to pass before the redirect was called. You might want to make that shorter or longer, depending upon how much interactivity you wanted your visitor to have before being redirected. You may want to allow them to use the old site if they want and redirect them if they don't click on a link in 30 seconds, or something like that. If you really want them to not have any choice, try setting the time (content) to 0. Although this, technically, redirects immediately, depending on someone's connection, the server hosting the website, etc, slow page load could cause the user to see the original page before it refreshes to the new site. If you're only using a page for this purpose, you should leave the body blank or with just a similar background image as your new site.

PHP is a bit faster, although still subject to the limitations of the interacting components that affect the HTML redirect (although not to the same degree). In PHP, you could do the same thing with this little chunk of code:

<?
Header( "HTTP/1.1 301 Moved Permanently" );
Header( "Location: http://www.myotherxyz.com/" );
?>


The most efficient method (of the three we're going run through here today) is to use htaccess. This point will be expanded upon to a good degree in tomorrow's post, since the combination of htaccess and "mod rewrite" can allow you to do a lot more than just standard redirection. Using htaccess, you could create a file called, strangely enough, .htaccess in your site's main directory (or root or home directory, or base directory or whatever you prefer to call it) that contained the same directions as above, although with yet another syntax. A simple 301 site redirect in .htaccess can be written this simply (the entire file from top to bottom):

redirect 301 / http://www.myotherxyz.com/

It just doesn't get any simpler than that :) For the htaccess method, since there are no double quotes surrounding your arguments (old-site-name new-site-name) you need to double quote any addresses that include spaces. Simple enough:

redirect 301 "/my old index.html" http://myotherxyz.com/

And that's all there is to it! For a little bit of trivia, you may have noted that we ended all of our redirect destinations with a closing / ( www.myotherxyz.com/ as opposed to www.myotherxyz.com ). The reason we do this is actually still standard protocol. You can actually do the test yourself right now to prove it out. When you send a standard web server a URI request for, say, http://www.google.com (for instance), that actually generates a 301 site redirect and plops you at http://www.google.com/, which is the properly formatted address. It happens very quickly and, of course, only counts when you're requesting a directory. If you specify a file, like http://www.google.com/index.html, you'll either get the page you were expecting or some other error (perhaps even a 301), although it should be noted that some web servers don't do this anymore in all instances (http://www.google.com/news, for example). Still, being a dinosaur, I prefer to stick to the way that should always work.

Give it a shot. It might be more fun than pick-up-sticks ;) If you find a site that does it, you can see the redirect in action using simple Telnet. Take this example using http://abc.go.com/daytime (Although this page actually does a 302 redirect - for temporary redirection - right now and takes you to a specific index page) - I grew tired of trying to find sites that use 301 redirection on sub-directories and you can't telnet to any server and not request a page, unless you want that 404 ("GET /" works, but proving the redirect by doing "GET " doesn't ;)

$ telnet abc.go.com 80
Trying 198.105.194.116...
Connected to abc.go.com.
Escape character is '^]'.
GET /daytime HTTP/1.0

HTTP/1.1 302 Found
Connection: close
Date: Thu, 26 Feb 2009 00:19:27 GMT
Location: /daytime/index?pn=index
Server: Microsoft-IIS/6.0
P3P: CP="CAO DSP COR CURa ADMa DEVa TAIa PSAa PSDa IVAi IVDi CONi
OUR SAMo OTRo BUS PHY ONL UNI PUR COM NAV INT DEM CNT STA PRE"
From: abc07
Set-Cookie: SWID=54AE3DC5-FB42-4F6C-8B10-82E506FFD442; path=/;
expires=Thu, 26-Feb-2029 00:19:27 GMT; domain=.go.com;
Content-Length: 163
X-UA-Compatible: IE=EmulateIE7

<HTML><HEAD><TITLE>Moved Temporarily</TITLE></HEAD><BODY>This document
has moved to <A HREF="/daytime/index?pn=index
">/daytime/index?pn=index
</A>.<BODY></HTML>Connection closed by foreign host.


See you tomorrow for some more-convoluted stuff.

Cheers,

, Mike




Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!



Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Tuesday, December 30, 2008

Unix And Linux Easter Eggs For The Wrong Holiday

Hey there,

Today, since it's just past Christmas and almost New Year's, I figured this would be a great time to trot out some Linux and/or Unix Easter eggs. Actually, it doesn't make sense at all, but if you can put aside your burnt-in sense of the chronological order of the holidays, these can still be fun ;)

I found all of the Easter Eggs for today at a site with the very strange name Eeggs.com. I don't know what an eegg is, and I'm not sure that I want to know, but they have a great collection of Easter Eggs for all manner of OS' ;) I spent most of my time in their Linux section, but you could spend hours on other sections of their site and only occasionally be reminded that you're still at work. Of course, in all seriousness, if you're at work, the thought of driving home as soon as possible is keeping you aware of your location at all times ;)

The following are a few of the cooler ones I ran across (AND could personally verify). If you get a chance, drop by Eeggs.com and submit a support email asking why "eegs" isn't in the dictionary when "ain't" is ;)

1. Fun with PHP. This has worked with every site I've tested it against. The key here is just to find a php-enabled site, and navigate to a php page. Then, all you need to do is pass the php page a few arguments on the browser command line to find these four gems.

For a working example, we'll look at linuxandunixupdates.com's index.php page. Using that URL, we can add the following four strings and get the following four easter eggs. All of the links in this section are set to open up in new windows, so you can click on the link above and add the strings manually, or you can just click on any of the links below. I've also included a picture of the outcome of running those commands below each "magic string" just in case you're worried that I might be luring you into clicking on a redirected link or something else I don't have the time to invest in doing properly right now ;) You should be able to replicate this on any php page on any site anywhere. I haven't been able to fully test the veracity of that claim, but it appears to be true so far!

a. Add ?=PHPE9568F34-D428-11d2-A769-00AA001ACF42 to the end of your URL to see this picture:

php logo

b. Add ?=PHPE9568F35-D428-11d2-A769-00AA001ACF42 to the end of your URL to see this picture:

zend engine 2 logo

c. Add ?=PHPE9568F36-D428-11d2-A769-00AA001ACF42 to the end of your URL to see this picture:

squiggly php logo

d. Add ?=PHPB8B5F2A0-3C92-11d3-A3A9-4C7B08C10000 to the end of your URL to see the PHP Credits. This page looks exactly like the standard info.php page, but lists all the developers who worked on each component. I haven't included it here because it's incredibly long and there are more Easter Eggs to get to before we all forget why we're here :)

2. MAGIC reboot times in the Linux Kernel. This one is interesting, and a bit of a puzzle, since the original entry only gives the answer to the first time (they're all significant to Linux in some way). In any event, you can find these times by looking in /usr/include/linux/*.h and grepping for LINUX_REBOOT_MAGIC. As you can see, below, in our includes, they're all in reboot.h:

host # grep LINUX_REBOOT_MAGIC /usr/include/linux/*.h
/usr/include/linux/reboot.h:#define LINUX_REBOOT_MAGIC1 0xfee1dead
/usr/include/linux/reboot.h:#define LINUX_REBOOT_MAGIC2 672274793
/usr/include/linux/reboot.h:#define LINUX_REBOOT_MAGIC2A 85072278
/usr/include/linux/reboot.h:#define LINUX_REBOOT_MAGIC2B 369367448
/usr/include/linux/reboot.h:#define LINUX_REBOOT_MAGIC2C 537993216


MAGIC2 (as well as the MAGIC2A, B and C) is where you'll find the Easter Egg. If you take any of those values and convert them into regular time (using Perl, for instance), they resolve to an important date in Linux history.

host # perl -e 'print localtime(672274793). "\n";'
Sun Apr 21 17:59:53 1991
host # perl -e 'print localtime(85072278). "\n";'
Mon Sep 11 10:11:18 1972
host # perl -e 'print localtime(369367448). "\n";'
Mon Sep 14 21:04:08 1981
host # perl -e 'print localtime(537993216). "\n";'
Sun Jan 18 12:33:36 1987


Sun Apr 21 17:59:53 1991 is supposedly (and I'm not using the word "supposedly" to cast any more doubt than any reasonable human being would have. I'm not sure if the following is true, so I can only "suppose" that the folks who submitted these Easter Eggs aren't just prepping a new Wikipedia page. Just kidding, of course. Everything in Wikipedia is true ;)) the date Linus Torvalds first began writing Linux. The rest is left up to us to figure out. Something tells me the answers are all somewhere in this Linux Online Timeline.

3. And lastly, so there's plenty more left for you to check out at Eeggs.com, I really enjoyed this last one (actually there were a few others I'm dying to try, along the same lines, but don't have the proper OS' to validate right now) since I'm a "huge" fan of Douglas Adams, even beyond the HitchHiker's Series (although lots and lots of people got really upset over Mostly Harmless when he chose to wrap up the HitchHiker's Trilogy (with the 5th book in the series) in a manner that, apparently, was extremely dissatisfying to ardent fans of the series. I don't begrudge them their opinions. I dug it. I'm only sorry that he passed away and that we'll never know if the The Salmon of Doubt was going to be the sixth HitchHiker's book (answering the fan's complaints, at worst) or the next Dirk Gently novel.

Back to planet earth ;) If you open up vim, and type the following:

host # vim
[esc]:help 42


with the [esc]: being the actual "escape" or "esc" key, followed by the colon (:)

You'll, sadly, not get an explanation of the answer to the meaning of life, the universe and everything, but the payoff's just as pleasant :)

What is the meaning of life, the universe and everything? *42*
Douglas Adams, the only person who knew what this question really was about is
now dead, unfortunately. So now you might wonder what the meaning of death
is...

==============================================================================

Next chapter: |usr_43.txt| Using filetypes
...


Hope you all enjoyed those Easter Eggs and, should you decide to look for more, happy hunting :)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Thursday, August 7, 2008

Programming PHP - Implicit OO In Error Catching Structure

Hey There,

Today, I'm proud to be putting up my first "guest post" and would like to thank Herschel Cohen for his outstanding contribution!

BTW, this page looks a lot better when it's not trapped in a blog post ;)

Without further ado, but some padding to keep the title from wrapping around...

Here's to your enjoyment :)



Web Site
Developers





Programming PHP: Implicit OO Code in Error Catching Structure



Learning from Errors



I have written on other occasions, you can learn more seeing errors others commit, whatever their cause, which can be more instructive than simply spelling out how to do it correctly. It simply might be the relief of not finding oneself trapped in the same predicament that makes the lesson more likely to stick. This time it was the try / catch code syntax that trapped me. It seemed too transparent and too easily understood, which resulted in my missing a critical aspect. That is today's topic of discussion.



I will make no attempt to explain my reasoning, it's just too unique to my background and to my warped mind. Nonetheless, I am sure others could find themselves in the same predicament. So I will focus on how I was fooled and my circuitous route towards enlightenment. See this too as my filing in the holes I left in my last article on catching the missing file error where I used the afore mentioned syntax.



Impression of try {} catch {} Code Syntax



When I was on a programming contract [1.], I took part in Java code reviews. I had the impression I could read and understand [2.] Java code. I was attracted to the try / catch code syntax, which appeared quite straight forward. That is, place the code you wish to run within the try section and if it fails it is automatically shifted to the catch section. There given enough specified exceptions the possibility existed to save the program from an ignominious crash. Moreover, in Java [3.], I thought that the ending "finally" section could allow full recovery. Nonetheless, many of my perceptions were fallacious.



Mistaken Impressions - How it Started



My mistakes began early, with the popping up of fairly complete descriptions of encountered errors I saw when testing methods to display missing menu names. In the attempt to redirect the error to a separate page, I mentioned seeing a ephemeral error message that was fired well within the template at the readfile() line for the menu listing. Moreover, later when I was explicitly testing for the file missing error it was painted onto the page in the location the menu listing should have appeared:



  Automatic Error Message

Figure 1. Automated Error Message


If you look closely at the content of the message above, it appears nearly identical to Example #3 The Built in Exception class. So while I knew an instance of the class had to be present, the error messages seemed to confirm the exception object was both present and functioning. Nonetheless, my try / catch code was not working properly. Below, I am repeating the outline of the code I used in other articles. Here is the section where the menu listing is read into the template:



   <div id="central-col-bst">
<?php
if ($if_error==false) { // adding new error catching code 5/31/2008

try {
readfile($menus . "/menu-list-".$get_it.".txt");

} catch (move_onException $e) {

$if_error = true;

// only action flips the variable
}
} else {
$gohome = $_SERVER['DOCUMENT_ROOT']."/page-content/navigation";
readfile($menus."/messages/menu-not-known.txt");
readfile($gohome."/home-button.txt");

}

if ($if_error==true) {
// This is were the email is sent when the
// text file is missing, then the error
// message is painted where the menu
// listing would have fallen.
} else {
echo "\n\n Code has run by email";
}

?>

</div?> <!-- End of central Product News column --?>

Listing 1 Code Structure to Catch readfile() errors


So I inserted additional code within the try section to determine why my code was failing:



   <div id="central-col-bst">
<?php
if ($if_error==false) { // adding new error catching code 5/31/2008

try {
readfile($menus . "/menu-list-".$get_it.".txt");
$level = error_reporting();
$val = error_reporting(E_WARNING);
print "Error value $level <br />";
print "Warning error value is $val <br />";


} catch (move_onException $e) {

$if_error = true;

// only action flips the variable
}
...

Listing 2 Modified Code in try section


Now look at the output when this version was run:



  Exception Object Not Functioning

Figure 2. Exception Object Not Functioning


Note the last line, "Code has run by email" indicating the variable was not flipped to true. Thus, the catch part of the code was never entered. Something is badly amiss with the object that is firing.



Java Explanation



Once I began to read [4.] about the object model of Java it became clear the object that was live and printed the messages was an instance of the Error class not the Exception(s). Below I have a graphic that is a modified version of the one in the book (cited in footnote 4). The Error class object was the one posting the error messages. The code failed to fire on the catch section, because no instance of the Exception object was available for use:



  Exception Object Not Functioning

Figure 3. Java Object for Errors & Exceptions


Therefore, it is obvious that object orientated code is an implicit part of the try / catch syntax, which might not be immediately obvious if the documentation were not studied.



Revised Error Catching Code



I am going to do little other than exhibit a simpler version of the code I showed in the last article. This time I will not use a subclass of the Exception class, hence, there is no need to store the class definition on the page. Creating an instance of the Exception suffices, because I am not using for any reason than reaching the catch section of the code. Here is the skeleton of the code that enhances Listing 1 slightly:



   <div id="central-col-bst">
<?php
if ($if_error==false) { // adding new error catching code 5/31/2008

try {
$file_exists = file($menus . "/menu-list-".$get_it.".txt");
if ($file_exists === FALSE) {
throw new Exception($e);
} else {
readfile($menus . "/menu-list-".$get_it.".txt");
}
} catch (move_onException $e) {

$if_error = true;

// only action flips the variable
}
} else {
$gohome = $_SERVER['DOCUMENT_ROOT']."/page-content/navigation";
readfile($menus."/messages/menu-not-known.txt");
readfile($gohome."/home-button.txt");

}

if ($if_error==true) {
// This is were the email is sent when the
// text file is missing, then the error
// message is painted where the menu
// listing would have fallen.
} else {
echo "\n\n Code has run by email";
}

?>

</div?> <!-- End of central Product News column --?>

Listing 3 Working Code Structure to Catch Missing File Errors


Extending Exception Object Model



I mentioned my dissatisfaction with the increasing volume of code where similar processes that differ slightly are appearing in multiple location on the page template. I have not tested out these ideas, however, it is not a great stretch to use a subclass of the Exception class to contain new methods that email differing messages to the webmaster based upon a variable determined by the error type. The same might work for a customized error message that is painted on the page. For those that like to compress their code, the try / catch syntax and printing of either the menu listing or error messages could be buried within a method in a new subclass. However, if you go that route, be prepared to document your intentions so that anyone that follows can maintain your site in your absence.



Summary



Once the object model operating characteristic are understood as part of the try / catch exceptions, indeed the syntax is relatively straight forward. Moreover, there are multiple exceptions be tested with all residing in a common subclass that is used repeatedly throughout an application. Such use should deflate the overall size of the application's code and increase its reliability.



Corrections, suggested extension or comments write: H. Cohen. If the mailto does not work, use this: hcohen[-At-]bst-softwaredevs.com.



     © Herschel Cohen, All Rights Reserved



Return B/ST Home or Dynamic Menu Page


____________________________________________________________________

1. Mostly writing stored procedures for Sybase on Unix and
associated financial reporting. Return

2. I had not written a line of Java code.
Return

3. Not part of the php syntax.Return

4. See the same numbered footnote in my last article.
Chapter 11 in "Core Java 2; Volume 1 - Fundamentals" has a
version of the graphic I display in the text. Error
objects have an existing instance and fire when needed to
stop the program. The Exception object has to be created.
Return





















, Mike

Tuesday, December 25, 2007

Website and URL Downloading Using ActiveState Perl for Windows

Merry Christmas:) Again, my apologies to those of you who don't observe the holiday.

Today's post (inlined scripts aside) will be nice and short. I'll leave it to you to refer to yesterday's post if you want more information, or commentary, regarding the basics of this script and how to use it, as it's almost exactly the same.

These are special versions of the dUrl and dUpeDL scripts that I've ported to use ActiveState's Perl for Windows. This comes in handy when you want to do some massive downloading and don't have access to a Linux or Solaris box.

Note that the one requirement of the original script, "wget," is also needed for this version of the script to work. You can download that for free at http://www.christopherlewis.com/WGet/default.htm. Also, again, note that you should modify the lines in dUrl that contain the dUpeDL script reference, as they indicate a fictitious location for it!

Enjoy the day off of work, the company of your family and/or friends and, hopefully, some much deserved peace and rest.

Cheers!


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!c:\perl\bin\perl

#
# 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

use LWP::Simple;

if ( $#ARGV < 0 ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}

$debug=0;
$multi=0;
$counter=1;
# simple for now - better save-system later, maybe...
# also, we'll make the shared download system a function

if ( $ARGV[0] eq "-f" ) {
if ( ! -f $ARGV[1] ) {
print "Can't find URL file $ARGV[1]!\n";
exit(2);
}
$multi_file=$ARGV[1];
$multi=1;
chomp($download_dir="$ARGV[1]");
$download_dir =~ s/\//_/g;
$download_dir =~ s/\\/_/g;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
} else {
chomp($download_dir="$ARGV[0]");
if ( $download_dir !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
$download_dir =~ s/.*\/\/([^\/]*).*/$1/;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
}

if ( $multi == 0 ) {
@dl_list=();
$url="@ARGV";
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "No point in continuing...\n";
print "Moving $download_dir to ${download_dir}.empty...\n";
rename("$download_dir", "${download_dir}.empty");
exit(4);
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`cd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
# Work that bastard extra hard if it's a PHP Trick-Link
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget.exe -q $dl_req");
} else {
# We need wget.exe because the Simple GET can't follow trails
system("wget.exe -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
# PHP links are a pain -
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`dir /B *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget.exe -q $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("c:\\docume~1\\user\\desktop\\dUpeDL.pl");
chdir("$this_dir");
} elsif ( $multi == 1 ) {
open(MULTIFILE, "<$multi_file");
@multi_file = <MULTIFILE>;
close(MULTIFILE);
print "------------------- MULTIFILE MODE ------------------------\n";
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "Trying next FILE\n";
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`cd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget.exe -q $dl_req");
} else {
system("wget.exe -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`dir /B *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget.exe -v $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("c:\\docume~1\\user\\desktop\\dUpeDL.pl");
chdir("$this_dir");
$counter++;
}
}

$|=1;

if ( $multi == 1 ) {
chdir("$this_dir");
rename("$multi_file", "${multi_file}.done");
}
exit(0);


---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net) - modified for win32

#!c:\perl\bin\perl

#
# dUpeDL - Based on the following script - only slightly modified to work with
# dUrl and Windows Perl.
# Below: The original liner notes for full attribution to the original author.
# Note that the attribution was taken verbatim from the Linux/Unix script and may not
# be entirely accurate due to the fact that this script is a win32 port.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}



, Mike




Monday, December 24, 2007

Mass URL Downloading Using Perl for Linux or Solaris

Hopefully everyone's Holidays are going well. For Christmas Eve (which many of you may not observe, but our family traditionally does), I've decided to put a script I'm currently working on at the gift table. Family heritage celebrates Christmas on the Eve, so this is actually on time. Since the other half of my family are traditionalists, we still open a few gifts on the regular Holiday. To that end, I've put together an interesting variation on this script (the same, but wholey different) for tomorrow's post.

I call the script below "dUrl" because I wrote it to download tons of URL's (Clever? No ;) I've actually used this script for business related purposes, with sections ripped out, as it does have its place in that environment. For the most part, though, I use it to do massive downloading of pictures, video, audio, etc. All the stuff that's totally unnecessary and, therefore, the most fun to pull down on a massive scale. It will basically take any page you feed it and rip all the content down, including those bothersome PHP redirects that everyone uses nowadays to discourage this sort of activity :)

Although I don't put any restrictions in the script (in the form of "requires" statements), this might not run on Perl versions lower than 5.6. Just a guess based on some other scripts I've had to upgrade that used a lot of the same elements. This script has also been personally tested on RedHat Linux 7.x up to AS5 and Sun Solaris 2.6 through 10. It should theoretically work on any machine that can make use of Perl, with some possible minor revisions to any system calls.

This script only requires that you have the LWP::Simple Perl module installed (and all of its required modules). I've found that the simplest way to set this all up (since the dependency tree can be humongous if you're working off of a plain-vanilla installation of Perl) is to use the CPAN module (If you don't have this already, download it and install it. It comes standard with even the barest of Perl installations, as far as I know - In a future post I'll go into the old-style method of downloading all the modules and building them using the Makefile.PL method). Generally, this should do it (except the first time you use it, at which point it will ask you a bunch of questions it pretty much already knows the answers to ;) :

perl -MCPAN -e shell

> install LWP::Simple


and let it ride. You'll probably have to answer "y" a lot.

The script is fairly simple to use. It runs in two different modes: Single and Multi File mode. In Single File mode (and I define "file," in this context, as a URL), it's invoked like this:

host # ./dUrl http://host.com/webpage.html

In Multi File mode, it's invoked much the same way, except the argument is a real file that contains a list of URL's, one per line, like so:

host # ./dUrl FILENAME
host # cat FILENAME
http://host.com/page1.html
http://host.com/page2.html
...


This is a project in progress (Working on getting my SourceForge/FreshMeat pages up) and there are several things to note. It currently:

- Does not support SSL. Only the http protocol will work with it.
- Has built in support for dealing with PHP redirects. These are a royal pain in the bottom and I'm probably going to make this an entirely separate script, or module, that this script will use. Currently, I'm dissecting and re-following links using temp file parsing. It works for most of the larger media-hosting providers, but is still too specific to work on "all" of them.
- This version only downloads images. Modify the $dl_item variable to get around this.
- Relies on "wget" for some of its functionality. This is due to the stage of infancy its in. See the beta version number for a good laugh ;)
- Is longer than it needs to be, for my own ease of debugging. Eventually, the duplication of typed text will be rolled up into loops.
- Contains some possibly curious liner notes ;)

And, finally, until I find a better solution (which may not be possible, because this one's pretty darned good), I incorporate a slightly modified version of the "findDupeFiles" script, written by Cameron Hayne (macdev@hayne.net). It is only modified to the degree that I needed to make it work as an add-on to my script. You can find the original version of the file at http://hayne.net/MacDev/Perl/findDupeFiles. Although he doesn't know it, Cameron's excellent MD5-based duplicate finding script has been invaluable in saving me lots of time. I originally just looked for duplicates and renamed them using a simple incrementing-number scheme (since I'm paranoid and assume that I might accidentally lose something unique, no matter how unlikely the case ;) I've included the modified version of Cameron's script in this post as well, giving him full attribution with regards to his script's original header comment section.

Note, also, that I've renamed his script dUpeDL, since it's not the pure version. Modify the lines in dUrl that contain that script name, as they indicate a fictitious location for it!

Happy Christmas Eve, everyone :) For those of you who don't observe the Holiday, again, just consider this a special post that showed up at an arbitrary time ;)


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/usr/local/bin/perl

#
# 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#


use LWP::Simple;

if ( $#ARGV < 0 ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}

$debug=0;
$multi=0;
$counter=1;
# simple for now - better save-system later...
# also, we'll make the shared download system a function

if ( $ARGV[0] eq "-f" ) {
if ( ! -f $ARGV[1] ) {
print "Can't find URL file $ARGV[1]!\n";
exit(2);
}
$multi_file=$ARGV[1];
$multi=1;
chomp($download_dir="$ARGV[1]");
$download_dir =~ s/\//_/g;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
} else {
chomp($download_dir="$ARGV[0]");
if ( $download_dir !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
$download_dir =~ s/.*\/\/([^\/]*).*/$1/;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
}

if ( $multi == 0 ) {
@dl_list=();
$url="@ARGV";
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "No point in continuing...\n";
print "Moving $download_dir to ${download_dir}.empty...\n";
rename("$download_dir", "${download_dir}.empty");
exit(4);
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`pwd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
# Work that bastard extra hard if it's a PHP Trick-Link
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget -q $dl_req");
} else {
# We need wget because the Simple GET can't follow trails
system("wget -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
# PHP links are a pain -
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget -q $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("/export/home/users/dUpeDL");
chdir("$this_dir");
} elsif ( $multi == 1 ) {
open(MULTIFILE, "<$multi_file");
@multi_file = <MULTIFILE>;
close(MULTIFILE);
print "------------------- MULTIFILE MODE ------------------------\n";
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "Trying next FILE\n";
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`pwd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget -q $dl_req");
} else {
system("wget -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget -v $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("/users/mgolvach/bin/dUpeDL");
chdir("$this_dir");
$counter++;
}
}

$|=1;

if ( $multi == 1 ) {
chdir("$this_dir");
rename("$multi_file", "${multi_file}.done");
system("tar cpf ${download_dir}.tar $download_dir");
}
exit(0);


---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net)

#!/usr/local/bin/perl

#
# dUpeDL - Based on the following script - only slightly modified to work with dURL
# Below: The original liner notes for full attribution to the original author.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}



, Mike