Given the code below, I want to match the first form occurrence. I found

Question

0

Asked: June 9, 20262026-06-09T20:32:32+00:00 2026-06-09T20:32:32+00:00

Given the code below, I want to match the first form occurrence. I found

0

Given the code below, I want to match the first form occurrence. I found out that negative lookahead ?! may be used to achieve that but it doesn’t work. What’s wrong with my regex?

#test
$test = "<form abc> foo </form> <form gg> bar </form>";
$test =~ m/<form[^>]*abc[^>]*>(?!.*form>.*)form>/s;
print $&;

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T20:32:34+00:00

First, before explaining the regex: Use a module like HTML::TreeBuilder to create a document tree, then fetch your information from there. Parsing HTML with regexes is too error prone to use in the real world.

The Problem with your regex

Here is your string:

"<form abc> foo </form> <form gg> bar </form>"

And your regex (written expanded for readability, as with the /x flag):

<form [^>]* abc [^>]* > (?! .* form> .* ) form>

<form anchores when the literal character sequence is found
[^>]* searches for a number of non-> characters. Initially it matches abc
abc matches the literal character sequence abc. But because the regexp engine currently sees a > it has to backtrack, until [^>]* matches .
[^>]* will match nothing, as the engine sees a >
> matches the >
The negative lookahead matches, when the expression .* form .* would not match.
- The .* would consume all characters until end of string.
- form> causes the engine to backtrack until the .* matches foo </form> <form gg> bar </.
- The .* matches nothing, but that is okay.

So the lookahead succeeds, but it is a negative lookahead, so the assertion failes. The last part of the Regex will not even be executed.

Strategies

The .* consumes too many chararacters in our case. This is called greedy matching.

Non-greedy matching is written with a trailing ? like .*?. This version consumes zero characters initially and first checks the next part of the pattern. If that doesn’t work, it consumes another character iteratively until there is a match.

A better Regex

<form [^>]* > .*? </form>

Inside the opening tag, only non-> characters are allowed. Between the tags, any character is allowed. We do non-greedy matching, so the first end tag matches and ends the regex.

However, this solution is a bit problematic. A tolerant HTML parser would not choke on a attr="val<u>e". We will. Also, the first </form> is matched, which is undesirable in the event that we have nested forms. While unproblematic in this use case, this regex is totally useless when matching <div>s or the like.

Regexp Grammars

Perl regexes are incredibly powerful and allow you to declare recursive grammars. The built-in syntax is a bit akward, but I recommend the Regexp::Grammars module to do that easily. Better yet, simply use a fully-fledged HTML Parser already lying around.

Fetching the match

The use of $& (and $` and $') is discouraged, as it makes perl incredibly inefficient. This won’t manifest itself in a small script, but its bad style anyway. Rather enclose your whole Regexp with parens to capture the match

m{ ( <form [^>]* > .*? </form> ) }

and then use $1.

The perlretut Tutorial may be a good introduction to understand Perl regexes.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Given the code below, I want to match the first form occurrence. I found

Leave an answerCancel reply

1 Answer

The Problem with your regex

Strategies

A better Regex

Regexp Grammars

Fetching the match

Leave an answer
Cancel reply