I’m trying to delete some things from a block of text using regex. I have all of my patterns ready, but I can’t seem to be able to remove two (or more) that overlap.
For example:
import re
r1 = r'I am'
r2 = r'am foo'
text = 'I am foo'
re.sub(r1, '', text) # Returns ' foo'
re.sub(r2, '', text) # Returns 'I '
How do I replace both of the occurrences simultaneously and end up with an empty string?
I ended up using a slightly modified version of Ned Batchelder’s answer:
def clean(self, text):
mask = bytearray(len(text))
for pattern in patterns:
for match in re.finditer(pattern, text):
r = range(match.start(), match.end())
mask[r] = 'x' * len(r)
return ''.join(character for character, bit in zip(text, mask) if not bit)
You can’t do it with consecutive
re.subcalls as you have shown. You can usere.finditerto find them all. Each match will provide you with a match object, which has.startand.endattributes indicating their positions. You can collect all those together, and then remove characters at the end.Here I use a
bytearrayas a mutable string, used as a mask. It’s initialized to zero bytes, and I mark with an ‘x’ all the bytes that match any regex. Then I use the bit mask to select the characters to keep in the original string, and build a new string with only the unmatched characters: