So my Perl script basically takes a string and then tries to clean it up by doing multiple search and replaces on it, like so:
$text =~ s/<[^>]+>/ /g;
$text =~ s/\s+/ /g;
$text =~ s/[\(\{\[]\d+[\(\{\[]/ /g;
$text =~ s/\s+[<>]+\s+/\. /g;
$text =~ s/\s+/ /g;
$text =~ s/\.*\s*[\*|\#]+\s*([A-Z\"])/\. $1/g; # replace . **** Begin or . #### Begin or ) *The
$text =~ s/\.\s*\([^\)]*\) ([A-Z])/\. $1/g; # . (blah blah) S... => . S...
As you can see, I’m dealing with nasty html and have to beat it into submission.
I’m hoping there is a simpler, aesthetically appealing way to do this. I have about 50 lines that look just like what is above.
I have solved one version of this problem by using a hash where the key is the comment, and the hash is the reg expression, like so:
%rxcheck = (
'time of day'=>'\d+:\d+',
'starts with capital letters then a capital word'=>'^([A-Z]+\s)+[A-Z][a-z]',
'ends with a single capital letter'=>'\b[A-Z]\.'
}
And this is how I use it:
foreach my $key (keys %rxcheck) {
if($snippet =~ /$rxcheck{ $key }/g){ blah blah }
}
The problem comes up when I try my hand at a hash that where the key is the expression and it points to what I want to replace it with… and there is a $1 or $2 in it.
%rxcheck2 = (
'(\w) \"'=>'$1\"'
}
The above is to do this:
$snippet =~ s/(\w) \"/$1\"/g;
But I can’t seem to pass the “$1” part into the regex literally (I think that’s the right word… it seems the $1 is being interpreted even though I used ‘ marks.) So this results in:
if($snippet =~ /$key/$rxcheck2{ $key }/g){ }
And that doesn’t work.
So 2 questions:
Easy: How do I handle large numbers of regex’s in an easily editable way so I can change and add them without just cut and pasting the line before?
Harder: How do I handle them using a hash (or array if I have, say, multiple pieces I want to include, like 1) part to search, 2) replacement 3) comment, 4) global/case insensitive modifiers), if that is in fact the easiest way to do this?
Thanks for your help –
Problem #1
As there doesn’t appear to be much structure shared by the individual regexes, there’s not really a simpler or clearer way than just listing the commands as you have done. One common approach to decreasing repetition in code like this is to move
$textinto$_, so that instead of having to say:You can just say:
A common idiom for doing this is to use a degenerate
for()loop as a topicalizer:The scope of this block will preserve any preexisting value of
$_, so there’s no need to explicitlylocalize$_.At this point, you’ve eliminated almost every non-boilerplate character — how much shorter can it get, even in theory?
Unless what you really want (as your problem #2 suggests) is improved modularity, e.g., the ability to iterate over, report on, count etc. all regexes.
Problem #2
You can use the
qr//syntax to quote the “search” part of the substitution:However I don’t know of a way of quoting the “replacement” part adequately. I had hoped that
qr//would work for this too, but it doesn’t. There are two alternatives worth considering:1. Use
eval()in yourforeachloop. This would enable you to keep your current%rxcheck2hash. Downside: you should always be concerned about safety with stringeval()s.2. Use an array of anonymous subroutines:
You could of course use a hash instead with some more useful key as the hash, and/or you could use multivalued elements (or hash values) including comments or other information.