Thursday, April 23, 2009

Beginning Modifications To Our Internet Mass Downloader For Linux And Unix

Hey there,

Today, we've got a little update (in need of some more updating) for those of you who like to scrape the web for pictures using our mass URL downloading Perl Script, or your own variation thereof. It should run on virtually any version or distro of Linux or Unix with Perl installed. If not, hopefully, it should only need minor modification.

NOTE: This update only addresses about 80% of the problems we've encountered. We'll post an update as soon as we figure out how to get around those ornery remaining 20% ;)

It seems that the folks at imagevenue (and other multimedia holding-tanks) have gotten around to changing the way they do their PHP redirects. Those are the annoying little scripts that open up a new window when you click on a hyperlink and then redirect you to another location which either contains the picture (or whatever) you want to download or (in some extreme cases) contain even more redirection. Of course, we don't blame them. What we're doing here, by breaking through all that nonsense to try and automate it in a Perl script, isn't unethical, but we understand that it might be a pain in the arse ;) And I'm sure (from what I see on download.com from time to time) that we're probably in the minority of people putting the hurt on them (and God bless them for still sticking around :)

This update is being presented in the form of a patch (If you need any help applying this one, check out this old post on using patch the easy way or, if you're familiar with "patch," just follow the simple prompts below to apply the attached patch below (created using "diff -c"). We've also included the same "dUpeDL" script that the Perl script calls (based on the findDupeFiles script by Cameron Hayne (macdev@hayne.net) - with full attribution and original headers included in the header of that fantastic "MD5 checksum + Size" duplicate checker).

In order to update your old version of "dUrl" (Check the above link if you need to download the latest version of the source), just download the original version (also, check out this post for some ideas about how to creatively download scripts from this blog; they sometimes cut and paste out as one continuous line!) and do the following (We're assuming your original script is called "dUrl" and our patch is called "dUrl.patch"):

host # cp dUrl dUrl.bak
host # wc -l *
325 dUrl
130 dUrl.patch
325 dUrl.bak
host # patch -p0 dUrl dUrl.patch
patching file dUrl

host # wc -l *;ls -l *
335 dUrl
325 dUrl.bak
130 dUrl.patch


Check the above link, also, for the easy way to back out the patch if you don't care for the mods. Also, once you're done, be sure to change all the "/home/mgolvach.." or "/users/..." paths that call the dUpeDL script to wherever you have that script located on your machine :)

Cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License


Begin Patch

*** dUrl Wed Apr 22 20:12:33 2009
--- dUrl.new Wed Apr 22 20:16:17 2009
***************
*** 1,7 ****
#!/usr/local/bin/perl

#
! # 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License</a>
#
--- 1,7 ----
#!/usr/local/bin/perl

#
! # 2009 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001b
#
# <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License</a>
#
***************
*** 189,194 ****
--- 189,196 ----
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
+ $phpcounter=1;
+ $phpurl = $dl_req;
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
***************
*** 237,244 ****
}
}
if ( $dl_req =~ /php\?/ ) {
! $dl_req =~ s/\&/\\&/g;
! system("wget -q $dl_req");
} else {
system("wget -q $dl_req");
}
--- 239,255 ----
}
}
if ( $dl_req =~ /php\?/ ) {
! if ( $dl_req =~ /img.php/ ) {
! $phpender = $dl_req;
! $phpstarter = $dl_req;
! $phpstarter =~ s/^(http:\/\/[^\/]*\/).*$/$1/;
! $phpender =~ s/^.*image=(.*)$/$1/;
! $phpcontent = "${phpstarter}$phpender";
! system("wget -q $phpcontent");
! } else {
! $dl_req =~ s/\&/\\&/g;
! system("wget -q $dl_req");
! }
} else {
system("wget -q $dl_req");
}
***************
*** 251,268 ****
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
! print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
! if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
! $php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;
!
$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
--- 262,278 ----
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
! print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
! if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
! $php_tail =~ s/.*SRC=\"([^\"]*)\".*/$1/;
$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
***************
*** 276,282 ****
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
! $php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
--- 286,292 ----
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
! $php_real_url =~ s/.*url=(http.*?\.[jgp][pin][gf]).*/$1/;
}
}
}
***************
*** 309,315 ****
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
! system("/users/mgolvach/bin/dUpeDL");
chdir("$this_dir");
$counter++;
}
--- 319,325 ----
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
! system("/export/home/users/dUpeDL");
chdir("$this_dir");
$counter++;
}



End Patch

---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net)

#!/usr/local/bin/perl

#
# dUpeDL - Based on the following script - only slightly modified to work with dURL
# Below: The original liner notes for full attribution to the original author.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}


, Mike




Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!



Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.