I want to replace consecutive symbols just one such as; this is a dog???

Question

0

Editorial Team

Asked: May 16, 20262026-05-16T19:52:42+00:00 2026-05-16T19:52:42+00:00

I want to replace consecutive symbols just one such as; this is a dog???

0

I want to replace consecutive symbols just one such as;

this is a dog???

to

this is a dog?

I’m using

str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)

however I notice that this might replace symbols in urls that might happen in my text.

like http://example.com/this–is-a-page.html

Can someone give me some advice how to alter my regex?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T19:52:44+00:00

So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for “parse HTML with regex” to find out why that might not be such a good idea.

Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don’t want to replace them inside a URL. How can you tell what a URL is? They don’t always start with http – let’s say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.

Furthermore, you’ll find lots of duplicate symbols that you definitely don’t want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you’re working on (||, && etc. come to mind.

So you might come up with something like

(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+

which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don’t start with ftp, http or mailto). Plus, it won’t work in Python since it uses variable repetition inside lookbehind.

All in all, you probably won’t get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.

EDIT:

OK, you’re already working on the parsed text, but it still might contain URLs.

Then try the following:

result = re.sub(
    r"""(?ix) # case-insensitive, verbose regex

    # Either match a URL 
    # (protocol optional (if so, URL needs to start with www or ftp))
    (?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$])

    # or
    |

    # match repeated non-word characters
    (?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""", 

    # and replace with both captured groups (one will always be empty)
    r"\g<URL>\g<rpt>", subject)

Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)…

Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can’t reference a group in a replacement if it hasn’t participated in the match. I give up… 🙁

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to replace consecutive symbols just one such as; this is a dog???

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply