Tuesday, October 23, 2007

Only Printing Out Matches With Perl Regular Expressions

Hey there,

This will deal with something every shell/perl scripter has to deal with from time to time. How do I extract "whatever" from this file and just output my match?

We'll assume you have a logfile that consists of lines like this:

<td class="prompt" height="15" width="30%"><span><td class="prompt" height="15" width="30%">Text-Text_Text</td>
<td class="tablecontent" width="70%">09/23/08 </td></span>

and you just want to get the date.

It's relatively easy to do (although this case is far from challenging since the date format is so different than the rest of the lines).

First you need to figure out what you're not going to print out. So, after you read the file into an array and are iterating through it (or however you prefer to process your files with Perl), you'll first want to put this line in your loop.

This will make sure that you don't bother printing out any lines that don't contain your match:

next if $match !~ /\d\d\/\d\d\/\d\d/;

This says don't do anything and continue to the next line if the pattern "\d(Any Decimal Character)\d\/(A forward slash - backslashed, since it's a special character)\d\d\/\d\d" isn't on the line.

Then you'll want to print out only the date for all the lines that contain it, with this following line:

$match =~ s/^.* (\d\d\/\d\d\/\d\d) .*$/$1/; print $match

Note that the $1 on the right hand side of the expression represents everything we've captured in between parentheses on the left hand side of the expression. We've basically substituted the entire line with the date :) And just so we see some output, we print the contents of $match (the date).

Simple enough :)

, Mike