Here is my regexp for finding URLs in some string (i need the group

Question

0

Asked: June 9, 20262026-06-09T20:59:30+00:00 2026-06-09T20:59:30+00:00

Here is my regexp for finding URLs in some string (i need the group

0

Here is my regexp for finding URLs in some string (i need the group for the domain because further actions are based on the domain) and i noticed for some strings ‘fffffffff’ in this example it’s very slow, there is something obvious i missing?

>>> URL_ALLOWED = r"[a-z0-9$-_.+!*'(),%]"
>>> URL_RE = re.compile(
...     r'(?:(?:https?|ftp):\/\/)?'  # protocol
...     r'(?:www.)?' # www
...     r'('  # host - start
...         r'(?:'
...             r'[a-z0-9]'  # first character of domain('-' not allowed)
...             r'(?:'
...                 r'[a-z0-0-]*'  #  characters in the middle of domain
...                 r'[a-z0-9]' #  last character of domain('-' not allowed)
...             r')*'
...             r'\.'  # dot before next part of domain name
...         r')+'
...         r'[a-z]{2,10}'  # TLD
...         r'|'  # OR
...         r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}'  # IP address
...     r')' # host - end
...     r'(?::[0-9]+)?'  # port
...     r'(?:\/%(allowed_chars)s+/?)*'  # path
...     r'(?:\?(?:%(allowed_chars)s+=%(allowed_chars)s+&)*'  # GET params
...     r'%(allowed_chars)s+=%(allowed_chars)s+)?'  # last GET param
...     r'(?:#[^\s]*)?' % {  # anchor
...         'allowed_chars': URL_ALLOWED
...     },
...     re.IGNORECASE
... )
>>> from time import time
>>> strings = [
...     'foo bar baz',
...     'blah blah blah blah blah blah',
...     'f' * 10,
...     'f' * 20,
...     'f' * 30,
...     'f' * 40,
... ]
>>> def t():
...     for string in strings:
...             t1 = time()
...             URL_RE.findall(string)
...             print string, time() - t1
... 
>>> t()
foo bar baz 3.91006469727e-05
blah blah blah blah blah blah 6.98566436768e-05
ffffffffff 0.000313997268677
ffffffffffffffffffff 0.183916091919
ffffffffffffffffffffffffffffff 178.445468903

Yeah i know there is another solution to use very simple regexp (word that contain dots for example) and use urlparse later to get domain, but urlparse doesn’t work as expected when we don’t have protocol in URL:

>>> urlparse('example.com')
ParseResult(scheme='', netloc='', path='example.com', params='', query='', fragment='')
>>> urlparse('http://example.com')
ParseResult(scheme='http', netloc='example.com', path='', params='', query='', fragment='')
>>> urlparse('example.com/test/test')
ParseResult(scheme='', netloc='', path='example.com/test/test', params='', query='', fragment='')
>>> urlparse('http://example.com/test/test')
ParseResult(scheme='http', netloc='example.com', path='/test/test', params='', query='', fragment='')
>>> urlparse('example.com:1234/test/test')
ParseResult(scheme='example.com', netloc='', path='1234/test/test', params='', query='', fragment='')
>>> urlparse('http://example.com:1234/test/test')
ParseResult(scheme='http', netloc='example.com:1234', path='/test/test', params='', query='', fragment='')

Yeah prepending http:// is also a solution(i’m still not 100% sure if there are no other urlparse issues) but i’m curious what’s wrong with this regexp anyway

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T20:59:32+00:00

I think it happens becuase of this part

...         r'(?:'
...             r'[a-z0-9]'  # first character of domain('-' not allowed)
...             r'(?:'
...                 r'[a-z0-0-]*'  #  characters in the middle of domain
...                 r'[a-z0-9]' #  last character of domain('-' not allowed)
...             r')*'
...             r'\.'  # dot before next part of domain name
...         r')+'

You should not use construction like this ([set_of_symbols#1]*[set_of_symbols#2])* if set_of_symbols#1 and set_of_symbols#2 have same symbols.

Please try to use the following code:

...         r'(?:'
...             r'[a-z0-9]'  # first character of domain('-' not allowed)
...             r'[a-z0-0-]*'  #  characters in the middle of domain
...             r'(?<=[a-z0-9])' #  last character of domain('-' not allowed)
...             r'\.'  # dot before next part of domain name
...         r')+'

It should work better.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Here is my regexp for finding URLs in some string (i need the group

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply