I want to replace a regex with ‘*’, but only if the regex is out side of <>. The whole point is to not interfere with the html tags.
I use this to replace:
re.sub(r'SOMEREGEX(?=[^>]*(<|$))', '*', line)
However I ran into his problem: if my regex is:
f.*k
Then this:
fzzzzzzzzz<HTMLTAG>zzzzzzzk
Would become an ‘*’, which I don’t want. How do I overcome this problem?
Constraints:
-All brackets are matched
-No nested brackets
-SOMEREGEX is provided by the user. I prefer not changing that.
You could try replacing the
.character – “any character at all” – with the character class[^<>], which matches any character except the angle brackets,<>. This would give the regexf[^<>]*k. This would matchfacebookbut notface<b>book.There are still things that can go wrong with this, though. Have you considered using a proper HTML parser instead of regular expressions? BeautifulSoup is easy, tasty and fun.