Friday, September 12, 2008

Maximizing Set Match Probability Using Perl On Linux Or Unix

Hey There,

As promised (I think ;), we're back today with some code (not the entire script that encompasses the last 2 weeks worth of posts, of course) to grind out the final concept in this string of surprisingly wide-ranging topics all centered around the concepts of number pools and guaranteed matches. We've covered a lot of different aspects of mathematics, regular expression matching, etc just moving toward our Objective. So, aside from the link to our initial introductory number pools post, we'll just suffice it so say that every post since then (excluding the weekend humour) has been directly related to this mini-series' stated Objective. I'd love to label this paragraph with a whole slew of hyperlinks, but I'm going to pretend you're like me and don't want to get zapped to some other page every 5 seconds just because your finger twitches while you're reading this column ;) You should be able to access any of the older posts via the side menu or by just clicking on the "older posts" link at the bottom of the post itself.

BTW, thanks to an anonymous reader who pointed out that I accidentally goofed when I was manually typing out the distribution of number sets in yesterday's post. The line, in the first grid, that read:

12478 124 127 128 147 148 178 247 248 258 358
should have read:

12478 124 127 128 147 148 178 247 248 278 478

I got sloppy while I was doing the manual typing of the 3-digit strings and accidentally re-typed the last 2 3-digit sets from the previous line. Hopefully, I've fixed it by the time you read this ;)

As I worked out this last part (the brain teaser being directly related here), I found that, although I could squeeze all 3-digit combinations of our 8 numbers into 8 5-digit lists of numbers, the ratio of match to payoff was fairly low (depends on what you're looking for, I suppose) as laid out in the statistics below:

FORMAT: List of 8 5-Digit Sets

12346 12358 12478 13678 14567 23457 25678 34568

FORMAT: Approx % of set pool == # of times matched == # of sets == list of 3-digit matches
41% == 2 == 23 == 123 124 128 136 138 146 147 167 178 234 235 247 257 258 278 345 346 358 368 456 457 567 568
59% == 1 == 33 == 125 126 127 134 135 137 145 148 156 157 158 168 236 237 238 245 246 248 256 267 268 347 348 356 357 367 378 458 467 468 478 578 678


Of course, the idea here is to get the maximum amount of unique matches in a 5-digit set, but it would be nice to have more than a single or double return on each hit (and a greater weighted percentage on the 3-digit sets that match more than once in different lines on the grid), so that the statistical difference is more in your favour (Note: This post is not about the Lottery. We neither recommend nor discourage you from gambling. It's not our place. If you must gamble, only bet what you can afford to lose. That's just good advice all around ;) This is pretty bad, with a 41%/59% ratio of above-single-match-hits vs. single-match-hits

If we took the following Perl code (runnable on any Linux or Unix distro that supports Perl 5.6 and up -- the behaviour of hashes changed slightly around the 5.6 release), which iterates through the 56 5-digit lists we started getting into on our posts on sorting Perl lists and finding the maximum number of pool sets using binomial coefficient theory. The 5-digit list while-loop is wrapped in another while-loop that iterates through all the possible 3-digit sets (since we want to match 3 digits minimum) and populates a hash with keys and values to determine what lists (in our 56 member 5-digit set list) are viable for matching all possible 3-digit matches and returning the highest probable yield. You'll notice that the output is a little skewed toward the lower numbers due to the matching algorithm:

foreach $smaller (@tri_set_pool) {
@smaller = split("",$smaller);
foreach $larger (@five_digit_pool) {
if ( $larger =~ $smaller[0] && $larger =~ $smaller[1] && $larger
=~ $smaller[2] ) {
$total_pool{$larger} .= $smaller;
last;
$|=1;
}
}
}
$counter = 1;
foreach $key (sort keys %total_pool ) {
print "Game $counter: $key\n";
$counter++;
}


Note that this code would give us "20" 5-digit match strings (up from our original 8), but a disproportionately better chance of a larger return (closer to the beginning, as noted above, which could be counted as a flaw ;)

FORMAT: List of 20 5-Digit Sets
12345 12346 12347 12348 12356 12357 12358 12367 12368 12378 12456 12457 12458 12467 12468 12478 12567 12568 12578 12678

FORMAT: Approx % of set pool == # of times matched == # of sets == list of 3-digit matches
1% == 11 == 1 == 125
6% == 10 == 3 == 123 124 128
1% == 8 == 1 == 127
1 % == 5 == 1 == 126
50% == 4 == 28 == 134 135 136 137 138 145 146 147 148 156 157 158 178 234 235 236 237 238 245 246 247 248 256 257 258 267 268 278
4% == 3 == 2 == 167 168
37% == 1 == 20 == 345 346 347 348 356 357 358 367 368 378 456 457 458 467 468 478 567 568 578 678


So, that's obviously a much better distribution of multiple 3-digit match occurrences (i.e., if you matched 123, you'd match it 10 times!) This tips thing back to good with a 63%/37% ratio of above-single-match-hits vs. single-match-hits.

Finally, one simple addition of a reversed 5-digit list array in the code, attacking the 3-digit matches against both and stripping off all 5-digit numbers that only contain a single match (note that this does not mean that no single match occurrences were found for any given 3-digit list!), gives us the best (as in most even) odds so far, although possibly not the best return on a win (statistics below the additional code following - pardon the cheap editor tricks ;) - This is another 20 list return, but with favorable-match odds that make it worth going over the modest 8-list list we started out with. Note, also, that this revision takes out the humungous skew between beginning and end (from 11 matches to 1 match) that our last revision created:

@five_digit_pool = qw(12345 12346 12347 12348 12356 12357 12358 12367 12368 12378 12456 12457 12458 12467 12468 12478 12567 12568 12578 12678 13456 13457 13458 13467 13468 13478 13567 13568 13578 13678 14567 14568 14578 14678 15678 23456 23457 23458 23467 23468 23478 23567 23568 23578 23678 24567 24568 24578 24678 25678 34567 34568 34578 34678 35678 45678);
@rev_five_digit_pool = qw(45678 35678 34678 34578 34568 34567 25678 24678 24578 24568 24567 23678 23578 23568 23567 23478 23468 23467 23458 23457 23456 15678 14678 14578 14568 14567 13678 13578 13568 13567 13478 13468 13467 13458 13457 13456 12678 12578 12568 12567 12478 12468 12467 12458 12457 12456 12378 12368 12367 12358 12357 12356 12348 12347 12346 12345);
@tri_set_pool = qw(123 124 125 126 127 128 134 135 136 137 138 145 146 147 148 156 157 158 167 168 178 234 235 236 237 238 245 246 247 248 256 257 258 267 268 278 345 346 347 348 356 357 358 367 368 378 456 457 458 467 468 478 567 568 578 678);

foreach $smaller (@tri_set_pool) {
@smaller = split("",$smaller);
foreach $larger (@five_digit_pool) {
print "$larger =~ $smaller[0] && $larger =~ $smaller[1] && $larger =~ $smaller[2]\n";
if ( $larger =~ $smaller[0] && $larger =~ $smaller[1] && $larger =~ $smaller[2] ) {
$total_pool{$larger} .= $smaller;
last;
$|=1;
}
}
}
foreach $smaller (@tri_set_pool) {
@smaller = split("",$smaller);
foreach $larger (@rev_five_digit_pool) {
print "$larger =~ $smaller[0] && $larger =~ $smaller[1] && $larger =~ $smaller[2]\n";
if ( $larger =~ $smaller[0] && $larger =~ $smaller[1] && $larger =~ $smaller[2] ) {
$total_pool{$larger} .= $smaller;
last;
$|=1;
}
}
}
$counter = 1;
foreach $key (sort keys %total_pool ) {
$total_poolkey = length $total_pool{$key};
if ( $total_poolkey <= 6 ) {
next;
}
print "$counter: $key -- $total_pool{$key}\n";
$counter++;
}


FORMAT: List of 20 5-Digit Sets
12345 12346 12347 12348 12356 12357 12358 12367 12368 12378 12678 13678 14678 15678 23678 24678 25678 34678 35678 45678

FORMAT: Approx % of set pool == # of times matched == # of sets == list of 3-digit matches
64% == 2 == 36 == 123 126 127 137 138 146 147 148 156 157 158 167 168 178 236 237 238 246 247 248 256 257 258 267 268 278 346 347 348 356 357 358 367 368 378 678
36% == 1 == 20 == 124 125 128 134 135 136 145 234 235 245 345 456 457 458 467 468 478 567 568 578


So, with this 64%/36% ratio of above-single-match-hits vs. single-match-hits, we've got about the same statistical odds as our last rendition. This version should get you above a single-match more often (since the other leaves you in single-match territory on all sets starting with 3 and up), but it does dampen your possible return (you'll never hit a 3-digit match that comes up more than twice).

Depending upon how you're using this sort of probabilistic determination, one version may appeal more to you than the other (And, again, if you use this to spread lottery picks, you still stand a really good chance of not having any of the numbers in your set match what they pick out of the air. I look at the lottery as straight 50% odds. They'll either pick my numbers or they won't, no matter how "overdue" any combinations may be (Believe me, if you live in Chicago and follow the Cubs, you can witness the fallacy of the "overdue win" in real-life every year ;) It's all random and the future can't be inferred from past events ...unless your local lottery is crooked ;)

Anyway, I hope you all have a great weekend, and I'll be sure to pop the cleaned up (and all put together) version of the script, that encompasses the last 2 weeks worth of post topics, as soon as I can next week. Here's hoping I can find you some more funny stuff for the weekend. I promise: No number jokes ;)

Cheers,

, Mike




Please note that this blog accepts comments via email only. See our Mission And Policy Statement for further details.