Tuesday, September 9, 2008

Finding Overlapping Matches Using Perl's Lookahead Assertion Matching On Linux and Unix

Hey there,

Here's another topic that relates to our larger series on
number pools and guaranteed combinations within fixed lists
while still being worthy of having its own post. Regular expression matching with Perl for Linux or Unix is fairly simple at its most basic (as are most implementations of regular expression matching). Taken to its most remedial level, you can use a regular expression as a poor man's "grep" (and, oddly enough, the "re" in grep actually stands for regular expression). Although this use of regular expression pattern matching does have its place, it doesn't really merit use outside of tools designed to apply it in that manner and will suck all the joy right out of creating your own regular expressions. In many instances, a well thought-out regular expression can convince most non-technical people in the room that you're a computer genius who's brain possesses more synapses, forming more bridges and firing more rapidly than anyone's ever should. If you show someone something like "/word/" they're far less likely to be impressed ( unless they consider putting letters in between delimiters a Herculean challenge ;)

If you just up your game a little, you begin to see the power of regular expressions in Perl, and - every once in a while - find yourself in a state of utter confusion. Ironically, this is because most regular expression engines give you more than enough rope to get the job done (or is it just enough to hang yourself with? That's for later. We'll try to keep this post positive :)

Regular expression matching within Perl makes it very simple to find out if a certain pattern might appear within a certain expression. For instance, given the expression:

123456abc123456abc

You could construct a Perl regular expression to find out how many times the pattern "123456" occurs in that expression and then print all of those occurrences out in a very short script that you could just type at the command line with the -e switch (and the ending "g" option to your /REGEXP/ definition, which indicates that Perl should match the REGEXP as many times as it possibly can within the given expression - with some limitations, as we'll see below ):

host # perl -e '$variable = "123456abc123456abc"; @variable = $variable =~ /123456/g;print "@variable\n";'
123456 123456


And you can do the same thing to match all contiguous 2 digit numerals in your expression. This is where you'll get a result that you, perhaps, wouldn't expect or don't want:

host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /\d\d/g;print "@variable\n";'
12 34 56 12 34 56


While this is correct, per Perl's (and most other) regular expression parsing rules, you'll note that your expression only returns six sets of double digits (In order: 12, 34, 56, 12, 34, 56). This simple match functionality matches two digits and then moves on to the next two; effectively forgetting the 2 digits that bridge the first 2 digits to the second 2 digits. This is totally normal behaviour. Once the first 2 digits are matched, they are removed from the match pool and the process repeats on the, now, shorter string.

However, if you want (or need) to know how many contiguous 2 digit numerals appear in your expression, you would want to see output like this:

12 23 34 45 56 12 23 34 45 56

demonstrating that there are actually 10 matches on a regular expression looking for contiguous double digits within your expression. And, this can be done!

Using what is commonly referred to as a lookahead expression (or lookahead assertion), you can match all contiguous 2 digit numbers in your expression "correctly" (by which we mean that the overlapping matches won't get swallowed up or lost in the matching process). This command line will do just that for you:

host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /(?=(\d\d))/g;print "@variable\n";'
12 23 34 45 56 12 23 34 45 56


This sort of match is also called a zero-width assertion, meaning that the parentheses surrounding your regular expression (i.e. (?=YourRegularExpression)) don't do any capturing. In fact the "(?=)" part of the regular expression is built into Perl and defined as a zero width lookahead assertion. This means you can use it to match a pattern based on what comes after it. For instance the zero width lookahead assertion:

(?=\d)

would match a numeric character. But, it would specifically match when it was "following" something else. So if you had a string "abcdef3", this pattern - "(\w(?=\d))" - would match any alpha character followed by a numeric:

host # perl -e '$variable = "abcdef3"; @ variable = $variable =~ /(\w(?=\d))/;print "@variable\n";'
f

and, yes, "f" is the only character in our string followed by a number.

So, basically, our expression "(?=(\d\d))" is matching any zero width (starting from zero and incrementing by one as many times as possible - because of the "g" we tag on to the end of our matches to make them "greedy" ;) pattern followed by 2 digits or numeric characters. You can prove that the lack of a character to "lookahead" from (as in our expression "(?=(\d\d))" ) will default to zero, by seeing what happens if you tell the pattern that you're looking for two digits following a single digit with "\d(?=(\d\d))", like so:

host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /\d(?=(\d\d))/g;print "@variable\n";'
23 34 45 56 23 34 45 56


You only get a return of 8 sets, instead of 10, on this match because you've stated that the first double digit match must follow a single digit. Knowing what you know now (without getting into negative lookahead assertions or lookbehind/negative-lookbehind assertions), it should be fairly simple for you to determine the maximum amount of sets you can generate from a given pool without worrying about normal regular expression loss. You can say, with certainty, that the string "123456abc123456abc" contains 12 single digits, 10 sets of double digits, 8 sets of triple digits, 6 sets of quadruple digits, 4 sets of quintuple digits and 2 sets of sextuple digits.

Below is a quick table showing you those results. Note that rather than just adding another "\d" to each line, which would make it spill over the edges of the page (if it hasn't already), we use the "{n}" notation, where "n" equals both the highest and lowest number of occurrences of your pattern (you can use {n,m} notation to indicate lowest (n) and highest (m), but using {n} is the equivalent of using {n,m} with both "n" and "m" being identical and is, therefore, more economical ...quite unlike this sentence ;)

host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /(?=(\d{1}))/g;print "@variable\n";' >-- The {1} isn't necessary, but we left it in there for consistency's sake.
1 2 3 4 5 6 1 2 3 4 5 6
host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /(?=(\d{2}))/g;print "@variable\n";'
12 23 34 45 56 12 23 34 45 56
host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /(?=(\d{3}))/g;print "@variable\n";'
123 234 345 456 123 234 345 456
host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /(?=(\d{4}))/g;print "@variable\n";'
1234 2345 3456 1234 2345 3456
host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /(?=(\d{5}))/g;print "@variable\n";'
12345 23456 12345 23456
host # perl -e '$variable = "123456abc123456abc"; @ variable = $variable =~ /(?=(\d{6}))/g;print "@variable\n";'
123456 123456


If you've checked out yesterday's post on determining maximum pool sets and combine it with this primitive "overlap matching" method of weighted distribution accounting, you can see that our Objective is beginning to come into sight :)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.