From my understanding,
(.)(?<!\1)
should never match. Actually, php’s preg_replace even refuses to compile this and so does ruby’s gsub. The python re module seems to have a different opinion though:
import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
Result:
(x)AAAA(A)(y)BBB(B)(z)
Can anyone provide a reasonable explanation for this behavior?
Update
This behavior appears to be a limitation in the re module. The alternative regex module seems to handle groups in assertions correctly:
import regex
test = 'xAAAAAyBBBBz'
print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz
print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)
Note that unlike pcre, regex also allows variable-width lookbehinds:
print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)
Eventually, regex is going to be included in the standard library, as mentioned in PEP 411.
This does look like a limitation (nice way of saying “bug”, as I learned from a support call with Microsoft) in the Python
remodule.I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it’s not clever enough to figure out that
\1will always be fixed-length. Why it doesn’t complain about this when compiling the regex, I can’t say.Funnily enough:
So better don’t use backreferences in lookbehind assertions in Python. Positive lookbehind isn’t much better (it also matches here as if it was a positive lookahead):
And I can’t even guess what’s going on here: