I have a scanned text and there may be some garbage characters inside the words. Garbage characters are typically not alphanumeric nor punctuation.
I have the following regex:
garbage_pat = re.compile(r"(\w*(?P<and>[^a-zA-Z0-9_ \t\n\r\f\v,.?!;:])+[\w(?P=and)]*)")
This regex finds the words that contain one garbage character correctly. If there are two or more garbage characters, the regex is splitting the words.
For example aut~mo¤il will be split into two words. How get I get my regex to return the whole word when it contains two or more garbage characters.
It seems that you are looking for an expression like this: