Not a complete newbie, but I still don’t understand everything about Regular expressions. I was trying to use Regex to strip out <p> tags and my first attempt
<p\s*.*>
was so greedy it caught the whole line
<p someAttributes='example'>SomeText</p>
I got it to work with
((.|\s)*?)
This seems like it should be just as greedy, can anyone help me understand why it isnt?
Trying to make this question as language non-specific as possible, but I was doing this with ColdFusion’s reReplaceNoCase if it makes a lot of difference.
The key difference is the
*?part, which creates a reluctant quantifier, and so it tries to match as little as possible. The standard quantifier*is a greedy quantifier and tries to match as much as possible.See e.g. Greedy vs. Reluctant vs. Possessive Quantifiers
As Seth Robertson noted, you might want to use a regex that does not depend on the greedy/reluctant behaviour. Indeed, you can write a possessive regex for best performance:
Here,
\s*+matches any number of white space, while[^>]*+matches any number of characters except>. Both quantifiers do not track back in case of a mismatch, which improves runtime in case of a mismatch, and for some regex implementations also in case of a match (because internal backtracking data can be omitted).Note that, if there are other tags starting with
<p(didn’t write HTML directly for a long time), you match these too. If you don’t want that, use a regex like this:This makes the whole section between
<pand>optional.