This arose from a discussion on formalizing regular expressions syntax. I’ve seen this behavior with several regular expression parsers, hence I tagged it language-agnostic.
Take the following expression (adjust it for your favorite language):
replace("input", "(.*)*", "$1")
it will return an empty string. Why?
More curiously even, the expression replace("input", "(.*)*", "A$1B") will return the string ABAB. Why the double empty match?
Disclaimer: I know about backtracking and greedy matches, but the rules laid out by Jeffrey Friedl seem to dictate that .* matches everything and that no further backtracking or matching is done. Then why is $1 empty?
Note: compare with (.+)*, which returns the input string. However, http://regexhero.com shows that there are still two matches, which seems odd for the same reasons as above.
Let’s see what happens:
(.*)matches"input"."input"is captured into group1.(.*)is repeated, another match attempt is made:(.*)matches the empty string after"input".1, overwriting"input".$1now contains the empty string.A good question from the comments:
(input)*matches"input". It is replaced by"AinputB".(input)*matches the empty string. It is replaced by"AB"($1is empty because it didn’t participate in the match)."AinputBAB"