Showing posts with label redirects. Show all posts
Showing posts with label redirects. Show all posts

Thursday, April 23, 2009

Beginning Modifications To Our Internet Mass Downloader For Linux And Unix

Hey there,

Today, we've got a little update (in need of some more updating) for those of you who like to scrape the web for pictures using our mass URL downloading Perl Script, or your own variation thereof. It should run on virtually any version or distro of Linux or Unix with Perl installed. If not, hopefully, it should only need minor modification.

NOTE: This update only addresses about 80% of the problems we've encountered. We'll post an update as soon as we figure out how to get around those ornery remaining 20% ;)

It seems that the folks at imagevenue (and other multimedia holding-tanks) have gotten around to changing the way they do their PHP redirects. Those are the annoying little scripts that open up a new window when you click on a hyperlink and then redirect you to another location which either contains the picture (or whatever) you want to download or (in some extreme cases) contain even more redirection. Of course, we don't blame them. What we're doing here, by breaking through all that nonsense to try and automate it in a Perl script, isn't unethical, but we understand that it might be a pain in the arse ;) And I'm sure (from what I see on download.com from time to time) that we're probably in the minority of people putting the hurt on them (and God bless them for still sticking around :)

This update is being presented in the form of a patch (If you need any help applying this one, check out this old post on using patch the easy way or, if you're familiar with "patch," just follow the simple prompts below to apply the attached patch below (created using "diff -c"). We've also included the same "dUpeDL" script that the Perl script calls (based on the findDupeFiles script by Cameron Hayne (macdev@hayne.net) - with full attribution and original headers included in the header of that fantastic "MD5 checksum + Size" duplicate checker).

In order to update your old version of "dUrl" (Check the above link if you need to download the latest version of the source), just download the original version (also, check out this post for some ideas about how to creatively download scripts from this blog; they sometimes cut and paste out as one continuous line!) and do the following (We're assuming your original script is called "dUrl" and our patch is called "dUrl.patch"):

host # cp dUrl dUrl.bak
host # wc -l *
325 dUrl
130 dUrl.patch
325 dUrl.bak
host # patch -p0 dUrl dUrl.patch
patching file dUrl

host # wc -l *;ls -l *
335 dUrl
325 dUrl.bak
130 dUrl.patch


Check the above link, also, for the easy way to back out the patch if you don't care for the mods. Also, once you're done, be sure to change all the "/home/mgolvach.." or "/users/..." paths that call the dUpeDL script to wherever you have that script located on your machine :)

Cheers,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License


Begin Patch

*** dUrl Wed Apr 22 20:12:33 2009
--- dUrl.new Wed Apr 22 20:16:17 2009
***************
*** 1,7 ****
#!/usr/local/bin/perl

#
! # 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License</a>
#
--- 1,7 ----
#!/usr/local/bin/perl

#
! # 2009 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001b
#
# <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/3.0/us/">Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License</a>
#
***************
*** 189,194 ****
--- 189,196 ----
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
+ $phpcounter=1;
+ $phpurl = $dl_req;
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
***************
*** 237,244 ****
}
}
if ( $dl_req =~ /php\?/ ) {
! $dl_req =~ s/\&/\\&/g;
! system("wget -q $dl_req");
} else {
system("wget -q $dl_req");
}
--- 239,255 ----
}
}
if ( $dl_req =~ /php\?/ ) {
! if ( $dl_req =~ /img.php/ ) {
! $phpender = $dl_req;
! $phpstarter = $dl_req;
! $phpstarter =~ s/^(http:\/\/[^\/]*\/).*$/$1/;
! $phpender =~ s/^.*image=(.*)$/$1/;
! $phpcontent = "${phpstarter}$phpender";
! system("wget -q $phpcontent");
! } else {
! $dl_req =~ s/\&/\\&/g;
! system("wget -q $dl_req");
! }
} else {
system("wget -q $dl_req");
}
***************
*** 251,268 ****
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
! print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
! if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
! $php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;
!
$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
--- 262,278 ----
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
! print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
! if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
! $php_tail =~ s/.*SRC=\"([^\"]*)\".*/$1/;
$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
***************
*** 276,282 ****
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
! $php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
--- 286,292 ----
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
! $php_real_url =~ s/.*url=(http.*?\.[jgp][pin][gf]).*/$1/;
}
}
}
***************
*** 309,315 ****
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
! system("/users/mgolvach/bin/dUpeDL");
chdir("$this_dir");
$counter++;
}
--- 319,325 ----
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
! system("/export/home/users/dUpeDL");
chdir("$this_dir");
$counter++;
}



End Patch

---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net)

#!/usr/local/bin/perl

#
# dUpeDL - Based on the following script - only slightly modified to work with dURL
# Below: The original liner notes for full attribution to the original author.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}


, Mike




Discover the Free Ebook that shows you how to make 100% commissions on ClickBank!



Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.

Tuesday, December 25, 2007

Website and URL Downloading Using ActiveState Perl for Windows

Merry Christmas:) Again, my apologies to those of you who don't observe the holiday.

Today's post (inlined scripts aside) will be nice and short. I'll leave it to you to refer to yesterday's post if you want more information, or commentary, regarding the basics of this script and how to use it, as it's almost exactly the same.

These are special versions of the dUrl and dUpeDL scripts that I've ported to use ActiveState's Perl for Windows. This comes in handy when you want to do some massive downloading and don't have access to a Linux or Solaris box.

Note that the one requirement of the original script, "wget," is also needed for this version of the script to work. You can download that for free at http://www.christopherlewis.com/WGet/default.htm. Also, again, note that you should modify the lines in dUrl that contain the dUpeDL script reference, as they indicate a fictitious location for it!

Enjoy the day off of work, the company of your family and/or friends and, hopefully, some much deserved peace and rest.

Cheers!


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!c:\perl\bin\perl

#
# 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

use LWP::Simple;

if ( $#ARGV < 0 ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}

$debug=0;
$multi=0;
$counter=1;
# simple for now - better save-system later, maybe...
# also, we'll make the shared download system a function

if ( $ARGV[0] eq "-f" ) {
if ( ! -f $ARGV[1] ) {
print "Can't find URL file $ARGV[1]!\n";
exit(2);
}
$multi_file=$ARGV[1];
$multi=1;
chomp($download_dir="$ARGV[1]");
$download_dir =~ s/\//_/g;
$download_dir =~ s/\\/_/g;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
} else {
chomp($download_dir="$ARGV[0]");
if ( $download_dir !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
$download_dir =~ s/.*\/\/([^\/]*).*/$1/;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
}

if ( $multi == 0 ) {
@dl_list=();
$url="@ARGV";
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "No point in continuing...\n";
print "Moving $download_dir to ${download_dir}.empty...\n";
rename("$download_dir", "${download_dir}.empty");
exit(4);
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`cd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
# Work that bastard extra hard if it's a PHP Trick-Link
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget.exe -q $dl_req");
} else {
# We need wget.exe because the Simple GET can't follow trails
system("wget.exe -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
# PHP links are a pain -
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`dir /B *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget.exe -q $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("c:\\docume~1\\user\\desktop\\dUpeDL.pl");
chdir("$this_dir");
} elsif ( $multi == 1 ) {
open(MULTIFILE, "<$multi_file");
@multi_file = <MULTIFILE>;
close(MULTIFILE);
print "------------------- MULTIFILE MODE ------------------------\n";
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "Trying next FILE\n";
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`cd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget.exe -q $dl_req");
} else {
system("wget.exe -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`dir /B *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget.exe -v $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("c:\\docume~1\\user\\desktop\\dUpeDL.pl");
chdir("$this_dir");
$counter++;
}
}

$|=1;

if ( $multi == 1 ) {
chdir("$this_dir");
rename("$multi_file", "${multi_file}.done");
}
exit(0);


---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net) - modified for win32

#!c:\perl\bin\perl

#
# dUpeDL - Based on the following script - only slightly modified to work with
# dUrl and Windows Perl.
# Below: The original liner notes for full attribution to the original author.
# Note that the attribution was taken verbatim from the Linux/Unix script and may not
# be entirely accurate due to the fact that this script is a win32 port.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}



, Mike




Monday, December 24, 2007

Mass URL Downloading Using Perl for Linux or Solaris

Hopefully everyone's Holidays are going well. For Christmas Eve (which many of you may not observe, but our family traditionally does), I've decided to put a script I'm currently working on at the gift table. Family heritage celebrates Christmas on the Eve, so this is actually on time. Since the other half of my family are traditionalists, we still open a few gifts on the regular Holiday. To that end, I've put together an interesting variation on this script (the same, but wholey different) for tomorrow's post.

I call the script below "dUrl" because I wrote it to download tons of URL's (Clever? No ;) I've actually used this script for business related purposes, with sections ripped out, as it does have its place in that environment. For the most part, though, I use it to do massive downloading of pictures, video, audio, etc. All the stuff that's totally unnecessary and, therefore, the most fun to pull down on a massive scale. It will basically take any page you feed it and rip all the content down, including those bothersome PHP redirects that everyone uses nowadays to discourage this sort of activity :)

Although I don't put any restrictions in the script (in the form of "requires" statements), this might not run on Perl versions lower than 5.6. Just a guess based on some other scripts I've had to upgrade that used a lot of the same elements. This script has also been personally tested on RedHat Linux 7.x up to AS5 and Sun Solaris 2.6 through 10. It should theoretically work on any machine that can make use of Perl, with some possible minor revisions to any system calls.

This script only requires that you have the LWP::Simple Perl module installed (and all of its required modules). I've found that the simplest way to set this all up (since the dependency tree can be humongous if you're working off of a plain-vanilla installation of Perl) is to use the CPAN module (If you don't have this already, download it and install it. It comes standard with even the barest of Perl installations, as far as I know - In a future post I'll go into the old-style method of downloading all the modules and building them using the Makefile.PL method). Generally, this should do it (except the first time you use it, at which point it will ask you a bunch of questions it pretty much already knows the answers to ;) :

perl -MCPAN -e shell

> install LWP::Simple


and let it ride. You'll probably have to answer "y" a lot.

The script is fairly simple to use. It runs in two different modes: Single and Multi File mode. In Single File mode (and I define "file," in this context, as a URL), it's invoked like this:

host # ./dUrl http://host.com/webpage.html

In Multi File mode, it's invoked much the same way, except the argument is a real file that contains a list of URL's, one per line, like so:

host # ./dUrl FILENAME
host # cat FILENAME
http://host.com/page1.html
http://host.com/page2.html
...


This is a project in progress (Working on getting my SourceForge/FreshMeat pages up) and there are several things to note. It currently:

- Does not support SSL. Only the http protocol will work with it.
- Has built in support for dealing with PHP redirects. These are a royal pain in the bottom and I'm probably going to make this an entirely separate script, or module, that this script will use. Currently, I'm dissecting and re-following links using temp file parsing. It works for most of the larger media-hosting providers, but is still too specific to work on "all" of them.
- This version only downloads images. Modify the $dl_item variable to get around this.
- Relies on "wget" for some of its functionality. This is due to the stage of infancy its in. See the beta version number for a good laugh ;)
- Is longer than it needs to be, for my own ease of debugging. Eventually, the duplication of typed text will be rolled up into loops.
- Contains some possibly curious liner notes ;)

And, finally, until I find a better solution (which may not be possible, because this one's pretty darned good), I incorporate a slightly modified version of the "findDupeFiles" script, written by Cameron Hayne (macdev@hayne.net). It is only modified to the degree that I needed to make it work as an add-on to my script. You can find the original version of the file at http://hayne.net/MacDev/Perl/findDupeFiles. Although he doesn't know it, Cameron's excellent MD5-based duplicate finding script has been invaluable in saving me lots of time. I originally just looked for duplicates and renamed them using a simple incrementing-number scheme (since I'm paranoid and assume that I might accidentally lose something unique, no matter how unlikely the case ;) I've included the modified version of Cameron's script in this post as well, giving him full attribution with regards to his script's original header comment section.

Note, also, that I've renamed his script dUpeDL, since it's not the pure version. Modify the lines in dUrl that contain that script name, as they indicate a fictitious location for it!

Happy Christmas Eve, everyone :) For those of you who don't observe the Holiday, again, just consider this a special post that showed up at an arbitrary time ;)


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/usr/local/bin/perl

#
# 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#


use LWP::Simple;

if ( $#ARGV < 0 ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}

$debug=0;
$multi=0;
$counter=1;
# simple for now - better save-system later...
# also, we'll make the shared download system a function

if ( $ARGV[0] eq "-f" ) {
if ( ! -f $ARGV[1] ) {
print "Can't find URL file $ARGV[1]!\n";
exit(2);
}
$multi_file=$ARGV[1];
$multi=1;
chomp($download_dir="$ARGV[1]");
$download_dir =~ s/\//_/g;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
} else {
chomp($download_dir="$ARGV[0]");
if ( $download_dir !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
$download_dir =~ s/.*\/\/([^\/]*).*/$1/;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
}

if ( $multi == 0 ) {
@dl_list=();
$url="@ARGV";
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "No point in continuing...\n";
print "Moving $download_dir to ${download_dir}.empty...\n";
rename("$download_dir", "${download_dir}.empty");
exit(4);
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`pwd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
# Work that bastard extra hard if it's a PHP Trick-Link
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget -q $dl_req");
} else {
# We need wget because the Simple GET can't follow trails
system("wget -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
# PHP links are a pain -
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget -q $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("/export/home/users/dUpeDL");
chdir("$this_dir");
} elsif ( $multi == 1 ) {
open(MULTIFILE, "<$multi_file");
@multi_file = <MULTIFILE>;
close(MULTIFILE);
print "------------------- MULTIFILE MODE ------------------------\n";
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "Trying next FILE\n";
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`pwd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget -q $dl_req");
} else {
system("wget -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget -v $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("/users/mgolvach/bin/dUpeDL");
chdir("$this_dir");
$counter++;
}
}

$|=1;

if ( $multi == 1 ) {
chdir("$this_dir");
rename("$multi_file", "${multi_file}.done");
system("tar cpf ${download_dir}.tar $download_dir");
}
exit(0);


---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net)

#!/usr/local/bin/perl

#
# dUpeDL - Based on the following script - only slightly modified to work with dURL
# Below: The original liner notes for full attribution to the original author.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}



, Mike