Monday, December 31, 2007

Network Port Querying Script

Hey there,

The script I've put together here was originally written to meet a certain demand. That demand was actually my own, but that's beside the point ;)

This script should come in useful for you if you ever need to query a port and find out what's going on with it (like who's using it and/or what process id is associated with it). It's simple to invoke (taking only the port number as its argument) and produces information that can be a great aid in troubleshooting network connection issues.

If you refer back to this previous post you can check out a small walkthrough regarding how to query a port using lsof and/or the proc commands. This script uses lsof also, but combines it with netstat to produce output in an easy to read format, while grabbing a little more information in the process. Assuming we call it portquery, it can be invoked like this:

host # ./portquery 22 <--- Let's just see what's going on with SSH

and it will produce output for you like the following. Note that it produces a formatted output block for every single process connected to a port. On a high-traffic machine, checking SSH might produce a few pages of output. This is what it looks like when it's run:

Port 22 Information :
Service = sshd
PID = 469
User = root
Protocol = TCP
Status = LISTEN
Port 22 Information :
Service = sshd
PID = 469
User = jimmy88
Protocol = TCP
Status = LISTEN


...and the list goes on to print out information blocks for every PID attached to that port. This script has been a great help for me not only in that it makes a manual process automatic, but also in that it's easy for other non-admins to read.

Here's hoping you have some use for it :)

Best Wishes,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/bin/ksh

#
# 2007 - Mike Golvach - eggi@comcast.net
#
# Usage: portquery [port number]
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

trap 'exit' 1 2 3 9 15
if [ $# -ne 1 ]
then
echo "Usage: $0 portNumber"
exit 1
fi

portnumber=$1

/bin/netstat -a |grep -w "$portnumber" >/dev/null 2>&1

if [ $? -ne 0 ]
then
echo "Nothing's listening on - or using - port $portnumber"
exit 1
fi

/usr/local/bin/lsof 2>&1|grep -v "^lsof:"|grep -w $portnumber 2>&1|while read x
do
portinfo=`echo $x|awk '{print $1 " " $2 " " $3 " " $4 " " $5 " " $6 " " $7 " " $8 " " $9 " " $10}'`
echo "Port $portnumber Information :"
echo " Service = `echo $portinfo|awk '{print $1}'`"
echo " PID = `echo $portinfo|awk '{print $2}'`"
echo " User = `echo $portinfo|awk '{print $3}'`"
echo " Protocol = `echo $portinfo|awk '{print $8}'`"
echo " Status = `echo $portinfo|awk '{print $10}'|sed 's/(//'|sed 's/)//'`"
done



, Mike




Sunday, December 30, 2007

Checking For Valid Email Addresses In Your CGI Forms

This weekend's obligatory code post is a little something I slapped together during a security remediation on a completely different script (it was actually written in ksh).

I actualy enjoy working with the security department on issues like this, because it gets me thinking about the potential flaws in the stuff I write and how I can improve on what I've already done. I don't think I could work as a security specialist; at least not stuck parsing scripts all day looking for minor errors (or major ones). It's not in my blood. I need to be able to get up and wander off to a server room, or someone else's desk, and give my eyes a break.

I put in a little extra functionality, at the end, since I was writing this in Perl, to actually try to resolve the domains once they managed to get past the cursory "grammar check." Hopefully you'll be able to integrate some or all of this into any CGI forms or scripts you've written or are in the process of writing.

This is not a complete CGI script; more of a function that you can use and add-in. If you do decide to do so, and use this for any forms processing, remember to add a section to strip out illegal characters. This script was written with very exacting specifications from the requestor. Check out the last part of this earlier post which goes into more detail about taking care of than end of business.

I think I covered all the bases. At this point, I probably still wouldn't want to hand this over to the security department ;) It has its flaws, for sure, but I like to put out works-in-progress if only to help kick-start some creative thinking on my (and/or your) part. In the end, it makes for better quality of work.

Best Wishes,


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/usr/local/bin/perl

#
# 2007 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

use Net::DNS;

$res = new Net::DNS::Resolver;

select(STDERR);

while (<STDIN>) {
chomp; # In certain cases you may need to change this line to chop;
$_ =~ s/\r$//;
$_ =~ s/ *$//;
($cid,$email,$rest) = split(/\|/,$_,3);
$cid =~ s/ //g;
if ($email !~ /\@/) {
print "No \@ symbol in address: $cid, $email\n";
} else {
($name,$dom) = split(/\@/,$email,2);
if ($name eq "") {
print "No name on left of @ symbol: $cid, $email\n";
} elsif ($dom eq "") {
print "No domain on right of @ symbol: $cid, $email\n";
} elsif ($dom =~ /\@/) {
print "Too many @ symbols: $cid, $email\n";
} elsif ($email =~ / /) {
print "Space character in email address: $cid, $email\n";
} elsif ($dom =~ /[,]/) {
print "Comma right of @ symbol: $cid, $email\n";
} else {
$dom =~ tr/A-Z/a-z/;
$dom =~ s/\.$//;
$good = 0;
if ($dcache{$dom} != 1) {
($nm,$aliases,$addrtype,$length,@addrs) = gethostbyname("$dom");
if ($nm ne "") {
$good = 1;
} else {
@mx = mx($res, $dom);
if ($#mx >= 0) {
$good = 1;
} else {
print "Bad domin right side of @ symbol: $cid, $email\n";
}
}
if ($good) {
$dcache{$dom} = 1;
} else {
next;
}
}
print STDOUT "$cid\|$name\@$dom\|$rest\n";
}
}
}


, Mike




Saturday, December 29, 2007

Securing All Of A Host's Network Programs At Once Using Extended ACL's

At most shops where I've worked, a lot of time and effort is put into securing this and that network service from certain users. Mostly, it's done on a per process basis. Like segregating Telnet or RSH, since those are considered insecure. But, for the most part, I find, what security folk really want to do, when locking down a box, is to prevent regular users from running any network commands, and only allow certain people in certain groups to avail themselves of those services. This can be likened to most firewall setups: Block everything and allow what you need to, as opposed to allowing everything and blocking what you find out is dangerous (which can be disastrous)

This is actually very simple to accomplish on both Linux and Solaris by using simple group permissions and File ACL's (or facl's). In this post we'll walk, step by step, through setting up a box with network services totally locked down for any users that we don't explicitly want to use them. Here we go:

First, you'll want to work on the group angle. All users are already members of a group (or groups) and these groups are used to limit (or augment) their access to certain areas of the OS and certain programs. For our purposes here, we'll set up an additional group. We'll call it noudptcp for kicks (since we're not going to let anyone who has this group as their primary group use any udp or tcp-based programs). We'll set it up just like any other group, like so:

bash@host # groupadd -g 999 noudptcp

Next, we'll visit the two critical files that we'll need to change in order to pull this all together in the next few steps: /dev/udp and /dev/tcp. They should look like this by default:

bash@host # ls -lL /dev/udp /dev/tcp
crw-rw-rw- 1 root sys 41, 0 Oct 2 2006 /dev/udp
crw-rw-rw- 1 root sys 42, 0 Oct 2 2006 /dev/tcp


Note that both of these character special files (device drivers) are readable and writable by everyone. This is one of the reasons any user can use Telnet, RSH, SSH and many other network-based programs on your system. Here, we'll run the getfacl command to view the ACL for the /dev/tcp device driver file and take a look at it, like so:

bash@host # getfacl /dev/tcp

# file: /dev/tcp
# owner: root
# group: sys
user::rw-
group::rw- #effective:rw-
mask:rw-
other:rw-


Then, we'll augment the ACL on both files (they're essentially the same) my favorite way. I like to just take the output of getfacl, modify it in a file, and then feed that file to setfacl (Note that I'm doing this on Solaris and that I will point out the differences for Linux along the way - for instance, the Linux getfacl output will look slightly different than this, but you can still dump it to a file and modify it), like so:

bash@host # cat FACL_FILE <--- We'll assume I already edited it, because that's so hard to emulate using a simple keyboard ;) -- Note also that I've removed the comments, as they don't get applied by setfacl and may be confusing.

user::rw-
group::rw- #effective:rw-
group:noudptcp:--- #effective:---
mask:rw-
other:rw-


Now, we feed this file to setfacl (Note that we only added the one extra line to set permissions for the noudptcp group) and the ACL will be updated so that members of the noudptcp group won't have any access to either /dev/tcp or /dev/udp (In Linux, use -M instead of -f):

bash@host # setfacl -f FACL_FILE /dev/tcp
bash@host # setfacl -f FACL_FILE /dev/udp


And now we can use getfacl to show the new permissions. In the "ls -lL" output, you'll notice the "+" at the end of the file listing, indicating that the default ACL has been modified, and the output of the getfacl command will be the same as our FACL_FILE, with the comments added by default:

bash@host # ls -lL /dev/udp /dev/tcp
crw-rw-rw-+ 1 root sys 41, 0 Oct 2 2006 /dev/udp
crw-rw-rw-+ 1 root sys 42, 0 Oct 2 2006 /dev/tcp
bash@host # getfacl /dev/tcp

# file: /dev/tcp
# owner: root
# group: sys
user::rw-
group::rw- #effective:rw-
group:noudptcp:--- #effective:---
mask:rw-
other:rw-


Finally, all we have to do is add any users we deem necessary to the noudptcp group and they will not be able to access any network services! Note that this will not prevent them from connecting "to" the box; it will just prevent them from connecting to another box from your box, or using ping, or any program that needs to access /dev/tcp or /dev/udp.

For instance, a few select lines of truss output show what happens when a user in the noudptcp group attempts to SSH off of the box:

open("/dev/udp", O_RDONLY) Err#13 EACCES
so_socket(2, 2, 0, "", 1) Err#13 EACCES
write(2, 0xFFBEF058, 27) = 27
s o c k e t : P e r m i s s i o n d e n i e d\r\n


Congratulations! Now everyone you don't want to be able to use network services on your box is no longer a threat. At least, not in that way ;) And, even better, they can still use everything else that they would normally be allowed to!

Cheers,

, Mike




Friday, December 28, 2007

Using Your Shell To Generate RANDOMness For Security Software

Random number generation, or entropy gathering is something I've always found interesting, because it's so rarely used (to the extent that is possible) and the basic underlying principle is becoming so heavily depended upon. As we enter an age where encryption is becoming not only the standard for network security, but, at times, the only option, this seems insane. I've literally walked into dozens of security software "situations" where it had been agreed upon, all around, that an impasse had been reached, because prngd, egd or the /dev/random and/or /dev/urandom character special devices were no longer functioning correctly. There was simply nothing left to do but wait for vendor support to figure out the problem or, at the very best, wait while we re-installed such and such an OS or software package.

This is a huge problem when your security department has decided that, not only is Telnet no longer a safe connection method, but it's so unsafe that it shouldn't even be on the OS as a backup method for connection. You're left with three options: A secure connection, direct console connection or nothing!

Of course, in a situation like the one I'm describing above, when the security software (Let's say SSH) stops functioning (and no one has a lingering session), you'll need to do your work via the system console. But you should, theoretically, be able to get SSH back up on its feet and running in just slightly more time than it would take you to restart it, once you've logged into the console (hopefully from home :)

The main reason that I see SSH go down is a problem with whatever random number generator it was compiled to use. Of the few I mentioned above, the /dev/random and /dev/urandom character special devices are the most commonly built-in "randomness" generators for SSH. On some older setups (Or for those traditionalists), this is left open and/or set to read random informatiom from a straight file (like /etc/random, or whatever you decide to call it).

On the older versions (where the program reads from a file to "generate randomness") you almost never see this problem, because there's nothing convoluted about the process. We're assuming SSH is fine for all of these examples, and (unless you don't have read access to the file or the filesystem that its on is corrupted), there's almost no way straight-file reads to create randomness can go wrong.

When you're reading from a domain socket or a character special device (like /dev/urandom) is when you may end up having an issue. However you can get around this in your shell using a common variable called RANDOM (Not very cleverly named, but a good indicator of its actual function - available in ksh, bash and almost all shells). This variable produces a random number from between a range of 0 to 32767 (Actual results may vary depending on OS, bit strength of OS, etc) and its value changes every time it's invoked. For instance:

host # a=0;while [ $a -lt 5 ];do echo $RANDOM;let a=$a+1;done
17417
6453
11016
3054
8647


Now, to proceed. We're going to assume that the default file your SSH uses to generate randomness and transfer it to /dev/urandom is missing. Otherwise, you'd just have to fix that and this would be no fun at all ;) In order to incorporate this functionality into your downed SSH server, you'll need to, in effect, create your own /dev/urandom and move the other one out of the way (It's cooked anyway, right? ;)

host # mv /dev/random /dev/norandom
host # mv /dev/urandom /dev/nourandom


Then proceed to create your simple script (We're just going to script on the command line, here) which, while it won't be nearly as secure as the "real" /dev/random was, will keep you afloat until you get the fix in. Theoretically, you could make it really complicated and as secure as you like just as long as you just follow the general outline like this:

First recreate the two necessary random files:

host # mknod /dev/random c 1 8 &&
host # mknod /dev/urandom c 1 9 &&
host # chown root:root /dev/random /dev/urandom
host # chmod 0644 /dev/random /dev/urandom


Then create the random text file, like so (We only need 512 bytes):

host # a=0;while [ $a -lt 92160 ];do echo $RANDOM >>BING;let a=$a+1;done
host # cat BING >/dev/urandom


Note that the above command line (which dumps its output into the file named BING could be a full fledged shell script. It could be highly convoluted and as hard to comprehend as you prefer. The script doesn't even need to work as long as it dumps out enough error data and you're redirecting STDERR to STDOUT -- e.g. 2>&1)

If you want, you can save this output for later, but, hopefully, you won't be needing it. Just dd (disk to disk copy) it back to itself, like this:

host # dd if=/dev/urandom of=BING count=1 >/dev/null 2>&1

Now you should be able to startup your SSH Daemon and finally get back to work ;)

Cheers,

, Mike




Thursday, December 27, 2007

Securing SUID Programs Using A Simple C Wrapper

This is an issue that comes up almost constantly, given the very nature of the Linux and Unix security model and the environments in which most of us admins work. More often than not, application users on machines will "need" to run scripts that require root privilege to function correctly. One such example would be the "ping" command. Although this seems like a harmless, and openly available, network troubleshooting tool, the only reason regular users can run it is because it's SUID-root. This, as simply as possible, means that the command runs as the user "root" no matter who actually invokes it. The setuid bit (the "s" in -r-sr-xr-x) is a permission that indicates that the given command will run as the userid that owns it. root owns the ping command; therefore, when users run ping, they're running it as the user root.

Now, the ping command has been around for quite a while and, as with almost all setuid programs, it's been the subject of many security compromises and exploits over the years. In general, because of the fact that "ping" is such an integral part of the OS, you don't need to worry about wrapping it (or other programs like it) in order to protect yourself against harm (Your vendor - or the development community - should be trying their hardest to do that for you :)

Instances do exist where regular users require that an uncommon command be run on a frequent basis, in order for them to do their jobs. That program (we'll just call it "PROGRAM" for no particular reason ;) needs to be run as another user for it to function correctly and it has to be run frequently enough that it becomes an inconvenience to "not" allow the users to run the command themselves. SUID (or setuid) wrapper scripts can be most effectively used in these sorts of situations.

A wrapper script/program, is (for all intents and purposes) just another layer of protection for the admins, users and operating system. If created properly and used judiciously, they can help minimize the risk associated with allowing regular users to run commands as userid's other than their own.

Optimally, you would want to limit SUID script/program execution to another generic user (if possible). So, for instance, if an application user needs a program to be run as the user oracle, setting them up with a shell wrapper to run that command as the oracle user shouldn't be cause for too much concern. The greatest security risk (no matter the relative security-weight of different accounts on your system) is when you need to wrap a script or program to be run as root.

Below, I've put together a simple wrapper written in c ( Check out Wietse Zweitze Venema's website, and work, for a really really really secure wrapper script ). I write the majority of my scripts in bash, ksh and Perl, but the SUID wrapper really requires that it be compiled in order to serve it's purpose most effectively. If people can easily read your code, it'll be easier for them to figure out ways around whatever steps you're taking to secure your servers. I'm not saying that, just because they could read your code, they could break it; but it would certainly make it easier for them. In the other extreme circumstance, if anyone got write access to a SUID script (assuming root privilege, since almost every OS now resets the setuid bits if a setuid script is modified by a regular user), they could (easily) change it a little and stand a good chance that no one would notice that they'd created a backdoor for themselves. If you modify a compiled c binary, it probably won't run anymore (which is the best security there is ;)

We'll dive into the sea of "c" in a future post, since it can be complicated and is rarely necessary to know in order to administrate, or use, a system.

For the script below, just substitute the PROGRAM you want to wrap, the arguments, if any (This script assumes only one "-v" - If you have more, add them as comma separated entries just like the first, and before the NULL entry specification), and the groupid check (Comment this out if you don't want to use it as an extra level of access checking security). We also make sure to change the real and effective uid and gid to "root" (make this any id you want) only after performing the access checks! Extra care is taken to make sure we reset to the regular user's real and effective uid and gid even before all that.

Note also that we use the strncmp command instead of strcmp (for string comparison) to check the command line arguments. The reason we use this is that the strncmp command requires you to give it a number as it's final argument and will not read past that many chars (I use start and stop as my only two options, and you can see that in strncmp arguments accordingly. This helps prevent a malicious user from executing a string buffer overflow which might allow them to crack your wrapper from the command line!

This c code can be compiled on gcc at least all the way back to version 2.81.x - It can be compiled very simply, like so:

gcc -o PROGRAM wrapper.c (With -o wrapper being the option ("-o") of whatever you want to call the compiled PROGRAM and wrapper.c being the text c code below)

Enjoy!


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <signal.h>
#include <strings.h>

/********************************************
* Wrapper - Secure Yourself *
* *
* 2007 - Mike Golvach - eggi@comcast.net *
* *
* Usage: COMMAND [start|stop] *
* *
********************************************/

/* Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License */

/* Define global variables */

int gid;

/* main(int argc, char **argv) - main process loop */

int main(int argc, char **argv)
{

/* Set euid and egid to actual user */

gid = getgid();
setegid(getgid());
seteuid(getuid());

/* Confirm user is in GROUP(999) group */

if ( gid != 999 ) {
printf("User Not Authorized! Exiting...\n");
exit(1);
}

/* Check argc count only at this point */

if ( argc != 2 ) {
printf("Usage: COMMAND [start|stop]\n");
exit(1);
}

/* Set uid, gid, euid and egid to root */

setegid(0);
seteuid(0);
setgid(0);
setuid(0);

/* Check argv for proper arguments and run
* the corresponding script, if invoked.
*/

if ( strncmp(argv[1], "start", 5) == 0 ) {
if (execl("/usr/local/bin/COMMAND", "COMMAND", "-v", NULL) < 0) {
perror("Execl:");
}
} else if ( strncmp(argv[1], "stop", 4) == 0 ) {
if (execl("/usr/local/bin/COMMAND", "COMMAND", "-v", NULL) < 0) {
perror("Execl:");
}
} else {
printf("Usage: COMMAND [start|stop]\n");
exit(1);
}
exit(0);
}


, Mike




Wednesday, December 26, 2007

Simple Factorial Generation - Perl versus Bash

Hey there,

I've seen this floating around the boards, so I thought I'd add my 2 cents. Lots of folks (more homework? When will it end?) are looking for scripts to help them find the factorial of any given number.

For those of you who may not know, the factorial of a number is the number itself multiplied by all the numbers from 1 up to that number. So, the factorial of 3 is: 1 times 2 times 3 = 6

Some of the scripts I see are severely convoluted, so I thought I'd put this up here as a little homework help. It can be solved with Perl in 10 lines (Could be less if I wasn't so hung up on formatting ;)

Interestingly enough - it can be done with the same amount of lines in Linux's bash shell, like so (assuming a recursive function). Or, as I wrote in a previous post, you "could" do it in 1 ;)

factorial () {

local number=$1
if [ "$number" -eq 0 ]
then
factorial=1
else
let "next = number - 1"
factorial $next
let "factorial = $number * $?"
fi
return $factorial
}



Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License


#!/usr/bin/perl

#
# 2007 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

print "factorial of: ";
chomp($factorial = <STDIN>);
$number = $factorial;
if ( $factorial == 0 ) {
$factorial = 1;
}
for ( $factor = $factorial - 1; $factor >= 1; --$factor ) {
$factorial *= $factor;
}
printf("Factorial of %d is : %g\n", $number, $factorial);


Enjoy,

, Mike





l

Tuesday, December 25, 2007

Corrected Some Blogspot Code Auto-Rewrites in Previous Two Posts

Just a quick shout.

I noticed a few errors caused by Blogspot's tag interpretation that made the code I posted for the last 2 days unworkable. I've fixed these errors.

If you notice any others (they almost always have to do with the < and > characters), please feel free to email me and let me know, and I'll be happy to revise whatever didn't make it to the post page as I intended.

You can check out this post regarding some of the problems posting code to blogspot. Maybe it will help you out a bit, too :)

, Mike




Website and URL Downloading Using ActiveState Perl for Windows

Merry Christmas:) Again, my apologies to those of you who don't observe the holiday.

Today's post (inlined scripts aside) will be nice and short. I'll leave it to you to refer to yesterday's post if you want more information, or commentary, regarding the basics of this script and how to use it, as it's almost exactly the same.

These are special versions of the dUrl and dUpeDL scripts that I've ported to use ActiveState's Perl for Windows. This comes in handy when you want to do some massive downloading and don't have access to a Linux or Solaris box.

Note that the one requirement of the original script, "wget," is also needed for this version of the script to work. You can download that for free at http://www.christopherlewis.com/WGet/default.htm. Also, again, note that you should modify the lines in dUrl that contain the dUpeDL script reference, as they indicate a fictitious location for it!

Enjoy the day off of work, the company of your family and/or friends and, hopefully, some much deserved peace and rest.

Cheers!


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!c:\perl\bin\perl

#
# 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#

use LWP::Simple;

if ( $#ARGV < 0 ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}

$debug=0;
$multi=0;
$counter=1;
# simple for now - better save-system later, maybe...
# also, we'll make the shared download system a function

if ( $ARGV[0] eq "-f" ) {
if ( ! -f $ARGV[1] ) {
print "Can't find URL file $ARGV[1]!\n";
exit(2);
}
$multi_file=$ARGV[1];
$multi=1;
chomp($download_dir="$ARGV[1]");
$download_dir =~ s/\//_/g;
$download_dir =~ s/\\/_/g;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
} else {
chomp($download_dir="$ARGV[0]");
if ( $download_dir !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
$download_dir =~ s/.*\/\/([^\/]*).*/$1/;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
}

if ( $multi == 0 ) {
@dl_list=();
$url="@ARGV";
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "No point in continuing...\n";
print "Moving $download_dir to ${download_dir}.empty...\n";
rename("$download_dir", "${download_dir}.empty");
exit(4);
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`cd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
# Work that bastard extra hard if it's a PHP Trick-Link
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget.exe -q $dl_req");
} else {
# We need wget.exe because the Simple GET can't follow trails
system("wget.exe -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
# PHP links are a pain -
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`dir /B *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget.exe -q $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("c:\\docume~1\\user\\desktop\\dUpeDL.pl");
chdir("$this_dir");
} elsif ( $multi == 1 ) {
open(MULTIFILE, "<$multi_file");
@multi_file = <MULTIFILE>;
close(MULTIFILE);
print "------------------- MULTIFILE MODE ------------------------\n";
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "Trying next FILE\n";
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`cd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget.exe -q $dl_req");
} else {
system("wget.exe -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`dir /B *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget.exe -v $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("c:\\docume~1\\user\\desktop\\dUpeDL.pl");
chdir("$this_dir");
$counter++;
}
}

$|=1;

if ( $multi == 1 ) {
chdir("$this_dir");
rename("$multi_file", "${multi_file}.done");
}
exit(0);


---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net) - modified for win32

#!c:\perl\bin\perl

#
# dUpeDL - Based on the following script - only slightly modified to work with
# dUrl and Windows Perl.
# Below: The original liner notes for full attribution to the original author.
# Note that the attribution was taken verbatim from the Linux/Unix script and may not
# be entirely accurate due to the fact that this script is a win32 port.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}



, Mike




Monday, December 24, 2007

Mass URL Downloading Using Perl for Linux or Solaris

Hopefully everyone's Holidays are going well. For Christmas Eve (which many of you may not observe, but our family traditionally does), I've decided to put a script I'm currently working on at the gift table. Family heritage celebrates Christmas on the Eve, so this is actually on time. Since the other half of my family are traditionalists, we still open a few gifts on the regular Holiday. To that end, I've put together an interesting variation on this script (the same, but wholey different) for tomorrow's post.

I call the script below "dUrl" because I wrote it to download tons of URL's (Clever? No ;) I've actually used this script for business related purposes, with sections ripped out, as it does have its place in that environment. For the most part, though, I use it to do massive downloading of pictures, video, audio, etc. All the stuff that's totally unnecessary and, therefore, the most fun to pull down on a massive scale. It will basically take any page you feed it and rip all the content down, including those bothersome PHP redirects that everyone uses nowadays to discourage this sort of activity :)

Although I don't put any restrictions in the script (in the form of "requires" statements), this might not run on Perl versions lower than 5.6. Just a guess based on some other scripts I've had to upgrade that used a lot of the same elements. This script has also been personally tested on RedHat Linux 7.x up to AS5 and Sun Solaris 2.6 through 10. It should theoretically work on any machine that can make use of Perl, with some possible minor revisions to any system calls.

This script only requires that you have the LWP::Simple Perl module installed (and all of its required modules). I've found that the simplest way to set this all up (since the dependency tree can be humongous if you're working off of a plain-vanilla installation of Perl) is to use the CPAN module (If you don't have this already, download it and install it. It comes standard with even the barest of Perl installations, as far as I know - In a future post I'll go into the old-style method of downloading all the modules and building them using the Makefile.PL method). Generally, this should do it (except the first time you use it, at which point it will ask you a bunch of questions it pretty much already knows the answers to ;) :

perl -MCPAN -e shell

> install LWP::Simple


and let it ride. You'll probably have to answer "y" a lot.

The script is fairly simple to use. It runs in two different modes: Single and Multi File mode. In Single File mode (and I define "file," in this context, as a URL), it's invoked like this:

host # ./dUrl http://host.com/webpage.html

In Multi File mode, it's invoked much the same way, except the argument is a real file that contains a list of URL's, one per line, like so:

host # ./dUrl FILENAME
host # cat FILENAME
http://host.com/page1.html
http://host.com/page2.html
...


This is a project in progress (Working on getting my SourceForge/FreshMeat pages up) and there are several things to note. It currently:

- Does not support SSL. Only the http protocol will work with it.
- Has built in support for dealing with PHP redirects. These are a royal pain in the bottom and I'm probably going to make this an entirely separate script, or module, that this script will use. Currently, I'm dissecting and re-following links using temp file parsing. It works for most of the larger media-hosting providers, but is still too specific to work on "all" of them.
- This version only downloads images. Modify the $dl_item variable to get around this.
- Relies on "wget" for some of its functionality. This is due to the stage of infancy its in. See the beta version number for a good laugh ;)
- Is longer than it needs to be, for my own ease of debugging. Eventually, the duplication of typed text will be rolled up into loops.
- Contains some possibly curious liner notes ;)

And, finally, until I find a better solution (which may not be possible, because this one's pretty darned good), I incorporate a slightly modified version of the "findDupeFiles" script, written by Cameron Hayne (macdev@hayne.net). It is only modified to the degree that I needed to make it work as an add-on to my script. You can find the original version of the file at http://hayne.net/MacDev/Perl/findDupeFiles. Although he doesn't know it, Cameron's excellent MD5-based duplicate finding script has been invaluable in saving me lots of time. I originally just looked for duplicates and renamed them using a simple incrementing-number scheme (since I'm paranoid and assume that I might accidentally lose something unique, no matter how unlikely the case ;) I've included the modified version of Cameron's script in this post as well, giving him full attribution with regards to his script's original header comment section.

Note, also, that I've renamed his script dUpeDL, since it's not the pure version. Modify the lines in dUrl that contain that script name, as they indicate a fictitious location for it!

Happy Christmas Eve, everyone :) For those of you who don't observe the Holiday, again, just consider this a special post that showed up at an arbitrary time ;)


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

#!/usr/local/bin/perl

#
# 2007 - Mike Golvach - eggi@comcast.net - beta v.000000000000000001a
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#


use LWP::Simple;

if ( $#ARGV < 0 ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}

$debug=0;
$multi=0;
$counter=1;
# simple for now - better save-system later...
# also, we'll make the shared download system a function

if ( $ARGV[0] eq "-f" ) {
if ( ! -f $ARGV[1] ) {
print "Can't find URL file $ARGV[1]!\n";
exit(2);
}
$multi_file=$ARGV[1];
$multi=1;
chomp($download_dir="$ARGV[1]");
$download_dir =~ s/\//_/g;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
} else {
chomp($download_dir="$ARGV[0]");
if ( $download_dir !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
$download_dir =~ s/.*\/\/([^\/]*).*/$1/;
if ( ! -d $download_dir ) {
system("mkdir $download_dir");
}
if ( ! -d $download_dir ) {
print "Can't make Download Directory ${download_dir}!\n";
print "Exiting...\n";
exit(2);
}
}

if ( $multi == 0 ) {
@dl_list=();
$url="@ARGV";
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "No point in continuing...\n";
print "Moving $download_dir to ${download_dir}.empty...\n";
rename("$download_dir", "${download_dir}.empty");
exit(4);
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`pwd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
# Work that bastard extra hard if it's a PHP Trick-Link
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget -q $dl_req");
} else {
# We need wget because the Simple GET can't follow trails
system("wget -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
# PHP links are a pain -
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget -q $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("/export/home/users/dUpeDL");
chdir("$this_dir");
} elsif ( $multi == 1 ) {
open(MULTIFILE, "<$multi_file");
@multi_file = <MULTIFILE>;
close(MULTIFILE);
print "------------------- MULTIFILE MODE ------------------------\n";
foreach $multifile_entry (@multi_file) {
@dl_list=();
print "-------------------- FILE $counter ------------------------\n";
$url="$multifile_entry";
if ( $url !~ /^http:\/\//i ) {
print "Usage: $0 [URL|-f URL file]\n";
print "URL must be in http:// format - no https yet\n";
exit(1);
}
chomp($url);
print "Parsing URL $url...\n";
$dl = get("$url");
@dl = split(/[><]/, $dl);
print "Feeding $url To The Machine...\n";
foreach $dl_item (@dl) {
next if ( $dl_item !~ /(href|img)/ );
next if ( $dl_item !~ /http:\/\// );
next if ( $dl_item !~ /(jpg|jpeg|gif|png)/ );
$dl_item =~ s/(a href|img src)=('|")//;
$dl_item =~ s/('|").*//;
push(@dl_list, $dl_item);
}
$is_it_worth_it = @dl_list;
if ( $is_it_worth_it == 0 ) {
print "No Image References found!\n";
print "Trying next FILE\n";
}
print "Churning Out URL Requests...\n";
if ( $debug == 0 ) {
print "j=jpg g=gif p=png ?=guess\n";
}
chomp($this_dir=`pwd`);
chdir("$download_dir");
$start_time=(time);
foreach $dl_req (@dl_list) {
$tmp_dl="";
$req_filename = $dl_req;
$req_filename =~ s/.*\///;
if ( $debug ) {
print "Grabbing $req_filename\n";
} else {
$file_ext = $req_filename;
$file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$file_ext =~ s/^(\w).*/$1/;
print "$file_ext";
}
}
if ( $dl_req =~ /php\?/ ) {
$dl_req =~ s/\&/\\&/g;
system("wget -q $dl_req");
} else {
system("wget -q $dl_req");
}
}
$end_time=(time);
$seconds = sprintf("%d", $end_time - $start_time);
print "...DONE in $seconds seconds!\n";
print "Looking for PHP Trick-Links...\n";
chdir("$download_dir");
@file_list=`ls -1d *php*`;
$file_list=@file_list;
if ( $file_list ) {
print "PHP Trick-Links Found. Attempting To Unravel...\n";
foreach $php_file (@file_list) {
chomp($php_file);
open(PHPFILE, "<$php_file");
@php_file = <PHPFILE>;
if ( $php_file =~ /img.php/ ) {
print "IMG - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /SRC=/ ) {
$php_tail = $php_seg;
$php_tail =~ s/.*SRC=\"(.*?)\">.*/$1/;

$php_real_url = $php_root . $php_tail;
} elsif ( $php_seg =~ /HREF=http/ ) {
$php_root = $php_seg;
$php_root =~ s/.*=(http:\/\/[^\/]*\/).*/$1/;
chomp($php_root);
}
$php_real_url = $php_root . $php_tail;
}
} else {
print "REGULAR - ";
foreach $php_seg (@php_file) {
if ( $php_seg =~ /url=http/ ) {
$php_real_url=$php_seg;
$php_real_url =~ s/.*url=(http.*?)&.*/$1/;
}
}
}
close(PHPFILE);
if ( $debug ) {
print "Deleting Bogus Download: $php_file\n";
} else {
print "X=";
}
unlink("$php_file");
if ( $debug ) {
print "Downloading Real URL : $php_real_url";
} else {
$php_file_ext = $php_real_url;
$php_file_ext =~ s/.*(jpg|gif|png).*/$1/;
if ( $php_file_ext !~ /(jpg|gif|png)$/ ) {
print "\?";
} else {
$php_file_ext =~ s/^(\w).*/$1/;
chomp($php_file_ext);
print "$php_file_ext ";
}
}
system("wget -v $php_real_url");
}
print "...Done!\n";
} else {
print "No PHP Trick-Links To Unravel... Good\n";
}
chdir("$download_dir");
# Trying more sophisticated MD5 duplicate checking
print "Checking for exact duplicates MD5-Sum+Size\n";
system("/users/mgolvach/bin/dUpeDL");
chdir("$this_dir");
$counter++;
}
}

$|=1;

if ( $multi == 1 ) {
chdir("$this_dir");
rename("$multi_file", "${multi_file}.done");
system("tar cpf ${download_dir}.tar $download_dir");
}
exit(0);


---- dUpeDL - Based almost entirely on the findDupeFiles script by Cameron Hayne (macdev@hayne.net)

#!/usr/local/bin/perl

#
# dUpeDL - Based on the following script - only slightly modified to work with dURL
# Below: The original liner notes for full attribution to the original author.
#
# findDupeFiles:
# This script attempts to identify which files might be duplicates.
# It searches specified directories for files with a given suffix
# and reports on files that have the same MD5 digest.
# The suffix or suffixes to be searched for are specified by the first
# command-line argument - each suffix separated from the next by a vertical bar.
# The subsequent command-line arguments specify the directories to be searched.
# If no directories are specified on the command-line,
# it searches the current directory.
# Files whose names start with "._" are ignored.
#
# Cameron Hayne (macdev@hayne.net) January 2006 (revised March 2006)
#
#
# Examples of use:
# ----------------
# findDupeFiles '.aif|.aiff' AAA BBB CCC
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the directories AAA, BBB, and CCC
#
# findDupeFiles '.aif|.aiff'
# would look for duplicates among all the files with ".aif" or ".aiff" suffixes
# under the current directory
#
# findDupeFiles '' AAA BBB CCC
# would look for duplicates among all the files (no matter what suffix)
# under the directories AAA, BBB, and CCC
#
# findDupeFiles
# would look for duplicates among all the files (no matter what suffix)
# under the current directory
# -----------------------------------------------------------------------------

use strict;
use warnings;

use File::Find;
use File::stat;
use Digest::MD5;
use Fcntl;

#REMOVE WHEN WE MERGE - UNNECESSARY
my $debug=0;

my $matchSomeSuffix;
if (defined($ARGV[0])) {
my @suffixes = split(/\|/, $ARGV[0]);
if (scalar(@suffixes) > 0) {
my $matchExpr = join('||', map {"m/\$suffixes[$_]\$/io"} 0..$#suffixes);
$matchSomeSuffix = eval "sub {$matchExpr}";
}
shift @ARGV;
}

my @searchDirs = @ARGV ? @ARGV : ".";
foreach my $dir (@searchDirs) {
die "\"$dir\" is not a directory\n" unless -d "$dir";
}
my %filesByDataLength;

sub calcMd5($) {

my ($filename) = @_;
if (-d $filename) {
return "unsupported";
}
sysopen(FILE, $filename, O_RDONLY) or die "Unable to open file \"$filename\": $!\n";
binmode(FILE);
my $md5 = Digest::MD5->new->addfile(*FILE)->hexdigest;
close(FILE);
return $md5;
}

sub hashByMd5($) {

my ($fileInfoListRef) = @_;
my %filesByMd5;
foreach my $fileInfo (@{$fileInfoListRef}) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $md5 = calcMd5("$dirname/$filename");
push(@{$filesByMd5{$md5}}, $fileInfo);
}
return \%filesByMd5;
}

sub checkFile() {

return unless -f $_;
my $filename = $_;
my $dirname = $File::Find::dir;
return if $filename =~ /^\._/;
if (defined($matchSomeSuffix)) {
return unless &$matchSomeSuffix;
}
my $statInfo = stat($filename) or warn "Can't stat file \"$dirname/$filename\": $!\n" and return;
my $size = $statInfo->size;
my $fileInfo = { 'dirname' => $dirname,
'filename' => $filename,
};
push(@{$filesByDataLength{$size}}, $fileInfo);
}

MAIN: {

find(\&checkFile, @searchDirs);
my $numDupes = 0;
my $numDupeBytes = 0;
if ( $debug ) {
print "Dupe Checking\n";
} else {
print "Dupe Checking - ";
}
foreach my $size (sort {$b<=>$a} keys %filesByDataLength) {
my $numSameSize = scalar(@{$filesByDataLength{$size}});
next unless $numSameSize > 1;
if ( $debug ) {
print "size: $size numSameSize: $numSameSize\n";
}
my $filesByMd5Ref = hashByMd5($filesByDataLength{$size});
my %filesByMd5 = %{$filesByMd5Ref};
foreach my $md5 (keys %filesByMd5) {
my @sameMd5List = @{$filesByMd5{$md5}};
my $numSameMd5 = scalar(@sameMd5List);
next unless $numSameMd5 > 1;
my $rsrcMd5;
my $dupe_counter=0;
foreach my $fileInfo (@sameMd5List) {
my $dirname = $fileInfo->{dirname};
my $filename = $fileInfo->{filename};
my $filepath = "$dirname/$filename";
if ( $dupe_counter == 0 ) {
if ( $debug ) {
print "KEEPING $filepath - MD5 $md5\n";
}
$dupe_counter++;
} else {
if ( $debug ) {
print "DELETING $filepath - MD5 $md5\n";
} else {
print "D";
}
unlink("$filepath");
}
}
if ( $debug) {
print "----------\n";
}
$numDupes += ($numSameMd5 - 1);
$numDupeBytes += ($size * ($numSameMd5 - 1));
}
}
print "----------\n";
my $numDupeMegabytes = sprintf("%.1f", $numDupeBytes / (1024 * 1024));
print "Number of duplicate files: $numDupes\n";
print "Estimated Mb Savings: $numDupeMegabytes\n";
}



, Mike




Sunday, December 23, 2007

Using Linux strace To Debug Application Issues

Today's post is a little bit of a walkthrough of using RedHat Linux strace to debug (and find the root cause of ) a system issue and a little bit of caution about how much information you should really share with application project managers if you don't want to be stuck supporting a hacked solution for longer than anyone should. As far as job security goes, it can't hurt, but certain situations really do require that folks upgrade their software to the vendor supported version, no matter how creative your solution ends up being.

Note: If you're looking for an in-depth examination of strace, this post isn't it. Not to turn any readers off, as we'll certainly be exploring that in a future post, but this is more of a walkthrough of a problem resolution involving strace rather than a dissertation on the command itself.

In this particular instance, we were working with a product (heretofore referred to as the Product or Product) that, to my knowledge, had just begun having "issues." This generally means to me that somebody did something they weren't supposed to or were trying to do something they weren't supposed to. I'm cynical, but I try to have the common courtesy to keep it to myself ;)

The Product was a client/server application running on the Java platform that had suddenly begun dropping connections from the application server to the backend database. I did a quick check of the /proc/****/status file (and, of course, gave a quick nod to "top" and "netstat -an"), confirming that they were dropping quite a few, like so:

host # cat /proc/14653/status | grep Threads
Threads: 311

host # top
<-- Truncated to just show the top process, which was the "failing" one.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14653 user 16 0 1634m 1.1g 40m S 120 14.1 296:46.54 java


I found the "total time used" (listed under the TIME+ column) somewhat alarming since the process had only been up for about a half an hour! It couldn't possibly be correct.

The next thing I did, since I like to go to the guts when I get brought in on an emergency issue regarding a Product I've never used before, was to shut it down and restart it with strace; teeing the information off to a file in /tmp (since I like to watch the characters fly by), like so:

host # strace -vF ./product arguments 2>&1|tee /tmp/OUTPUT

Note that this could have also been accomplished by running the process, getting it's pid from the ps tables and adding the -p PID flag to the strace command in order to attach to the process after starting it. I prefer to start the process with strace, if I can, in case something critical happens during the initial load-up.

Long story short (It is Sunday - Our day of rest - after all ;), I found that the Product process was dumping tons of FUTEX_WAIT errors. At this point, I checked the version of the OS with uname for the only part I really cared about at that point:

x86-64

The machine was running RedHat Advanced Server 5 - 64 bit. The FUTEX_WAIT errors have to do with the implementation of Fast Userspace Mutex locking implemented in the stable version of the RedHat Linux kernel 2.6 and up. This raised a red flag, and caused me to ask the inevitable question: When did this start happening?

As it turns out, the Product had previously been hosted on another machine running RedHat Advanced Server 4, 32 bit. This made perfect sense, since the Mutex locking mechanism at the kernel level they "used" to run on was completely different, and not compatible, since the Product was built to use the old style Mutex locking mechanisms.

Futile attempts were made to add lame fixes, like:

LD_ASSUME_KERNEL=2.4.whatever (Apologies; I don't remember what it was off-hand since it didn't end up helping at all ;)

to the Product's startup scripts, but this didn't resolve the problem. The "actual" (and correct) resolution would have been to upgrade to the vendor-supported version of the Product. The manufacturer of the Product actually had a more recent version available that was built specifically to address this issue and run on RedHat AS 5 - 64 bit.

And, here's the kicker (as per the cautionary message at the beginning of this post): As some of you may know, a bug in strace (It's actually more of a "feature" since it almost "has" to happen) makes it slow down a process slightly due to all the extra work it has to do to parse all the system calls, opens, reads, writes, errors, library loads, etc.

strace slowed the process down enough that the FUTEX_WAIT errors stopped occurring and the Product began functioning normally again. We ended up adding it to the init scripts after much debate over spending a few bucks to avoid possible total confusion in the future.

In retrospect, I'm glad that I was able to help identify the issue and find the root cause. On the other hand, I wish there was some way I could have avoided identifying the problem without providing a cheap and quick "out" for the end user. I'm sure, in the future, someone else will have to work on this Product (maybe a problem with the version of Java or something more insidious) and they'll be given about as much information as I was. Documentation was laid down, but I don't expect anyone will read it. Sometimes, the solution is as bad as the problem, in a way or two ;)

Enjoy your Sunday :)

, Mike





Saturday, December 22, 2007

Working with Linux RPM's

This post is a continuation, of sorts, of my last post. This is more of a general-audience post. Most experienced admins know most of this stuff already. Like I mentioned previously, I try to write this blog with an appreciation for what it was like when I first started out in the business. I owe my success to a great many patient and helpful people.

In this post, I wanted to hit on the basics of working with RPM's in Linux (RPM stands for the Redhat Package Management system - basically, they're the software packages that make up your system). In later posts we'll go into some neat tricks... But for now, we'll stick with the basics. Knowing the basics in any field of interest is invaluable in growing and mastering that skillset, just like knowing your ABC's can really help if you ever intend to read or write :)

Check the bottom for a recap of all the RPM options we're going to use and their literal meanings:

1. To display the basic information for any RPM, just type:

host # rpm -qi RPM_NAME - like:

host # rpm -qi bash
Name : bash Relocations: /usr
Version : 2.05 Vendor: Red Hat, Inc.
Release : 8.2 Build Date: Mon 28 Jun 2004 10:33:55 AM CDT
Install date: Thu 12 Jan 2006 01:25:27 PM CST Build Host: host.redhat.com
Group : System Environment/Shells Source RPM: bash-2.05-8.2.src.rpm
... EDITED OUT FOR BREVITY'S SAKE!


2. If you're not sure where to start with the above command, just have RPM spit out all the packages it knows about and pipe that to more, like so:

host # rpm -qa|more
redhat-logos-1.1.3-1
glibc-2.2.4-32.18
cracklib-2.7-12
dosfstools-2.7-1
gdbm-1.8.0-11
...


3. Now that you've figured out what package you want to inspect (Note that you don't have to include the full name to get the information from RPM. The redhat-logos-1.1.3-1 program can be referred to simply as redhat-logos) and have gotten some basic information about it, you can list out all the files associated with the package like this:

host # rpm -ql bash
/bin/bash
/bin/bash2
/bin/sh
/etc/skel/.bash_logout
...


4. Here's one that doesn't require a lot of output, since it's somewhat of a re-explanation. You can add the -p flag to the examples in points 1 and 3 if you're querying an RPM package, and not the RPM database!

host # rpm -qip bash-2.05-8.2.i386.rpm <--- Listing out information for the RPM package itself.
host # rpm -qlp bash-2.05-8.2.i386.rpm <--- Listing out files associated with the RPM package itself.

5. Of course, you may find a file and want to know what RPM package it belongs to. You can get that by typing:

host # rpm -qif /etc
Name : filesystem Relocations: (not relocateable)
Version : 2.1.6 Vendor: Red Hat, Inc.
Release : 2 Build Date: Mon 20 Aug 2001 03:34:02 PM CDT
Install date: Thu 12 Jan 2006 01:24:41 PM CST Build Host: host.redhat.com
Group : System Environment/Base Source RPM: filesystem-2.1.6-2.src.rpm Vendor: Red Hat, Inc.
... (Just as long as the description in point 1)


6. If you want to install a new RPM, you'll need the package file, and would run RPM like this:

host # rpm -i bash-2.05-8.2.i386.rpm

This isn't very interesting (which may be what you want -- I don't care to look at verbose output "all" the time). You can spice it up by adding the -v and/or -h flag, like so:

host # rpm -ivh bash-2.05-8.2.i386.rpm

7. If you want to uninstall an RPM, you'll just need to know the abbreviated name, like I mentioned in point 4). You can also make this as verbose and visually entertaining as the system will allow with -v and/or -h:

host # rpm -e bash

Note that this command would return an error if you had multiple instances of the bash RPM installed. In that case, you could still abbreviate, but would have to include the version number. So you'd type

host # rpm -e bash-2.05.8.2

instead of just bash.

So, to recap, and possibly explain anything I may have glossed over, these basic commands should get you started working with the RPM package management facility on Linux. The translations of the flags we've covered are as follows:

Major flags (usually the ones preceded with a dash, but you can arrange the flags in whatever order you choose - just be careful - see note in the minor flags):

q = query
i = install
e = remove/uninstall

Minor flags

i = information (not the same as the major flag i. Of course, you'll probably never use -ii or -ei, as the combinations would be redundant and opposite, respectively.
a = all
l = list
p = RPM package file (e.g. whatever.rpm)
f = file
v = verbose
h = hash (prints lots of # symbols while it completes your request :)

Enjoy getting started working with RPM packages. They're one of the foundations of the Linux operating system. In fact, a combination of certain packages actually "is" the operating system. Knowing how to manipulate them and have them work for you can make it easier to explore many other things (like new software you've always wanted to install and try out :)

Best wishes,

, Mike





A Few Linux Networking Tips

Here's a little something for those of us who use Linux (The place I work uses RedHat primarily) on a day to day basis. Lots of shops these days are switching from the more expensive solutions, like those offered by Sun and HP, to cut cost of deployment and maintenance, which opens up a great pathway into the Linux administration field for folks who are eager to learn. I can tell you, after working all angles of the *nix arena for a decade or so, that there's nothing more grating than having to fill out a form and wait for someone to do something you're supposed to be being paid to be good at. I'm not going to sneeze at my paycheck, of course, but I'm worried that, if I work at too many big corporations with big contracts, my mind will atrophy and I'll die a slow miserable death long before my body gives out ;)

In this post, I just wanted to touch on some Linux networking basics. I try to write this blog for users of all skill levels and I think some of my posts assume a lot of pre-knowledge. This isn't a "for dummies" site, by any means (I've never understood how they marketed that series of books so well. I've tried flat-out telling people they're idiots for most of my natural life and it's never ended well ;). That being said, I'm hoping this blog attracts folks with a wealth of experience and is also accessible to those new to the field.

So, today, I'm going to touch on some Linux networking tips at a basic level. As with everything, over time, I'll dive into these subjects in greater detail. But for now, we'll get to the meat of the post. A lot of times, I find, it makes more sense to know what to do and understand it later, rather than the opposite. To that end (This is all pretty much RedHat specific - please don't try this on a Sun box without a healthy ego - the error messages can be blunt ;) here we go:

1. Adding a default router:

You can generally do this one of two different ways:

If the NIC is already configured and UP, all you need to do is use the route command, like so:

host # route add default gw 127.0.0.1 dev eth0

If you want to add a default route permanently, you'll just need to add this line to the existing /etc/sysconfig/network configuration file:

Note that the NETWORKING and HOSTNAME variables should already be in there (If not, assign them values of yes and "whatever your hostname is", respectively. Also, your network may not be up ;)

GATEWAY=127.0.0.1

Of course, if you prefer, you can always use the /etc/init.d/network script to bounce your NIC's and routes (You set these up in the /etc/sysconfig/network and /etc/sysconfig/network-scripts/ifcfg-eth0, etc, scripts). You can also use the service command to bring up your network, bring it down or restart it (which will, again, re-read your configuration).

2. Displaying your NIC's device driver settings:

This is most commonly done with a command called ethtool. To get your NIC's settings you could do the following:

host # ethtool -i eth0
driver: tg3
version: 3.10u6
firmware-version:
bus-info: 05:01.0


You can then use this info to help with problem solving. For instance, if eth0 isn't coming up correctly, perhaps eth0 doesn't have a proper alias setup in /etc/modules.conf, like:

alias eth0 tg3 <--- This line tells us that eth0 is actually an alias referring to the tg3 device driver.

If you look in there and it's:

alias eth0 e1000

you've found the problem right there!)

3. Display your NIC's Speed, Duplex and Negotiation settings (also with ethtool):

This one is just as simple as the command above (ethtool is much nicer than using old-style ifconfig, netstat and/or kstat - although they all have their virtues and are necessary depending on how old your Linux distro is)

host # ethtool eth0
Settings for eth0:
Cannot get device settings: Resource temporarily unavailable
Supports Wake-on: g
Wake-on: d
Current message level: 0x000000ff (255)
Link detected: no


Looks like the link's not up for that one! It's just as easy to spot if all's well (You can sometimes tell from a good distance away ;):

host # ethtool eth1
Settings for eth1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 1000baseT/Half 1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Current message level: 0x000000ff (255)


4. Now to configure, or initialize, those settings, we'll use ethtool as well:

Assuming we found out that the device associated with eth0 didn't have the cable connected, and we've got that all set up (along with correcting the alias in /etc/modules.conf, if that was wrong), we could fix it all by doing something like this:

ethtool -s eth0 speed 1000 duplex full autoneg off

You can add as many option/value pairs as you need (e.g. speed 1000) to get the job done. Don't forget to update the /etc/sysconfig/network and /etc/sysconfig/network-scripts/ifcfg-eth0, etc, files with your settings. If you don't, you'll have to do this every time the machine reboots!

Hope this has helped you out some or, at least, helped you get started taking on the Linux world :)

Cheers,

, Mike