Saturday, May 17, 2008

Ignoring All Standard Characters Using Perl In Linux Or Unix

Hey there,

Every once in a while, you have to reverse the order of your thinking. For today's example of code to do cherry-picking of values, we'll demonstrate just that. In most scripts, you're either looking for a specific set of values (or a relatively specific set), or you're trying to ignore them. Unix and/or Linux will attempt to print any character it finds in a script if you ask it to(even if it's, technically, unrepresentable, like a control-character sequence), which can make for some interesting output. Here, we're going to look for everything that we don't want to find (??? ;)

Basically, our script attached to today's post is going to look for every character in a given file that counts as a "special" character. And, by special, I mean goofy :) Since we have no idea what kind of insane characters we might not want to see, we have to begin by defining everything that we know and excluding all of that so that we only match, and ignore, things we don't know about.

This starts out simple, of course. We know we want to ignore the alphabet (upper and lower case) and the regular set of numerals. The next step is relatively simple as well: We know we want to ignore all the other "normal" characters. If you recall from our post on generating all possible passwords using Perl, there are 94 regular characters (including the alphaBET and numbers, noted already) that we need to ignore, plus simple stuff like spaces, tabs, newlines, carriage return and bells. There may be more... we'll never know until we don't ignore it ;)

The trickiest part of script-work like this is the Hell of backslash-escaping that you'll inevitably get caught up in. Hopefully, the script we've attached today will help you out in that regard. If you feed this script any file, it should print out only the lines with"bizarre" characters in them.

For example, given a file with these contents:

hey there
little ÿ¾åbÿ

Running this Perl script will produce the following result (printing only lines with strange characters in them)

host # ./ FILE
little ÿ¾åbÿ

Here's to finding out what you don't know :)


Creative Commons License

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License


# - only print out lines with unknown chars
# 2008 - Mike Golvach -
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License

if ( $#ARGV != 0 ) {
print "Usage: $0 FileName\n";

$filename = $ARGV[0];

open(FILE, "<$filename");
while (<FILE>) {
if ( $_ =~ /[^A-Za-z0-9\s\t\\r\a`\-=\[\]\\;\',\.\/~!@#$%^&\*\(\)_+\{\}\|:\"<>\?)]/ ) {
print $_

, Mike