Given the code below, I want to match the first form occurrence. I found out that negative lookahead ?! may be used to achieve that but it doesn’t work. What’s wrong with my regex?
#test
$test = "<form abc> foo </form> <form gg> bar </form>";
$test =~ m/<form[^>]*abc[^>]*>(?!.*form>.*)form>/s;
print $&;
First, before explaining the regex: Use a module like
HTML::TreeBuilderto create a document tree, then fetch your information from there. Parsing HTML with regexes is too error prone to use in the real world.The Problem with your regex
Here is your string:
And your regex (written expanded for readability, as with the
/xflag):<formanchores when the literal character sequence is found[^>]*searches for a number of non->characters. Initially it matchesabcabcmatches the literal character sequenceabc. But because the regexp engine currently sees a>it has to backtrack, until[^>]*matches.[^>]*will match nothing, as the engine sees a>>matches the>The negative lookahead matches, when the expression
.* form .*would not match.The
.*would consume all characters until end of string.form>causes the engine to backtrack until the.*matchesfoo </form> <form gg> bar </.The
.*matches nothing, but that is okay.So the lookahead succeeds, but it is a negative lookahead, so the assertion failes. The last part of the Regex will not even be executed.
Strategies
The
.*consumes too many chararacters in our case. This is called greedy matching.Non-greedy matching is written with a trailing
?like.*?. This version consumes zero characters initially and first checks the next part of the pattern. If that doesn’t work, it consumes another character iteratively until there is a match.A better Regex
Inside the opening tag, only non-
>characters are allowed. Between the tags, any character is allowed. We do non-greedy matching, so the first end tag matches and ends the regex.However, this solution is a bit problematic. A tolerant HTML parser would not choke on a
attr="val<u>e". We will. Also, the first</form>is matched, which is undesirable in the event that we have nested forms. While unproblematic in this use case, this regex is totally useless when matching<div>s or the like.Regexp Grammars
Perl regexes are incredibly powerful and allow you to declare recursive grammars. The built-in syntax is a bit akward, but I recommend the
Regexp::Grammarsmodule to do that easily. Better yet, simply use a fully-fledged HTML Parser already lying around.Fetching the match
The use of
$&(and$`and$') is discouraged, as it makes perl incredibly inefficient. This won’t manifest itself in a small script, but its bad style anyway. Rather enclose your whole Regexp with parens to capture the matchand then use
$1.The
perlretutTutorial may be a good introduction to understand Perl regexes.