I’m not so able with regex and I’m looking for the syntax to exclude something.
I’m parsing <, >, " and & in html code (to replace with <, etc) and I need to exclude <br/> from parsing.
I.E.:
<html><br/>
<head><title></title></head><br/>
<body><br/>
</body><br/>
</html>
I tried sometihng like i.e.: r'<\b?![br]' and others, but they don’t work completely. I use re.sub() to replace.
Ok, now the question is open again, I can do it as an answer, so…
Unless I’m missing something, and once it’s just
<br/>(not any variants), then can just replace<(?!br/>)with<and(?<!<br/)>with>and that’s it?In Python, it looks like that means this:
To explain what’s going on,
(?!…)is a negative lookahead – it only successfully matches at a position if the following text does not match the sub-expression it contains.(Note lookaheads do not consume the text matched by their sub-expression, they only verify if it exists, or not.)
Similarly,
(?<!…)is a negative lookbehind, and does the same thing but using the preceding text.However, lookbehinds do have a slight different to lookaheads (in some regex implementations) – which is that the sub-expressions inside lookbehinds must represent fixed-width or limited-width matches.
Python is one of the ones that requires a fixed width – so whilst the above expression works (because it’s always four characters), if it was
(?<!<br\s*/?)>then it would not be a valid regex for Python because it represents a variable length match. (However, you can stack multiple lookbehinds, so you could potentially manually iterate the assorted options, if that was necessary.)