Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8398219
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T20:59:30+00:00 2026-06-09T20:59:30+00:00

Here is my regexp for finding URLs in some string (i need the group

  • 0

Here is my regexp for finding URLs in some string (i need the group for the domain because further actions are based on the domain) and i noticed for some strings ‘fffffffff’ in this example it’s very slow, there is something obvious i missing?

>>> URL_ALLOWED = r"[a-z0-9$-_.+!*'(),%]"
>>> URL_RE = re.compile(
...     r'(?:(?:https?|ftp):\/\/)?'  # protocol
...     r'(?:www.)?' # www
...     r'('  # host - start
...         r'(?:'
...             r'[a-z0-9]'  # first character of domain('-' not allowed)
...             r'(?:'
...                 r'[a-z0-0-]*'  #  characters in the middle of domain
...                 r'[a-z0-9]' #  last character of domain('-' not allowed)
...             r')*'
...             r'\.'  # dot before next part of domain name
...         r')+'
...         r'[a-z]{2,10}'  # TLD
...         r'|'  # OR
...         r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}'  # IP address
...     r')' # host - end
...     r'(?::[0-9]+)?'  # port
...     r'(?:\/%(allowed_chars)s+/?)*'  # path
...     r'(?:\?(?:%(allowed_chars)s+=%(allowed_chars)s+&)*'  # GET params
...     r'%(allowed_chars)s+=%(allowed_chars)s+)?'  # last GET param
...     r'(?:#[^\s]*)?' % {  # anchor
...         'allowed_chars': URL_ALLOWED
...     },
...     re.IGNORECASE
... )
>>> from time import time
>>> strings = [
...     'foo bar baz',
...     'blah blah blah blah blah blah',
...     'f' * 10,
...     'f' * 20,
...     'f' * 30,
...     'f' * 40,
... ]
>>> def t():
...     for string in strings:
...             t1 = time()
...             URL_RE.findall(string)
...             print string, time() - t1
... 
>>> t()
foo bar baz 3.91006469727e-05
blah blah blah blah blah blah 6.98566436768e-05
ffffffffff 0.000313997268677
ffffffffffffffffffff 0.183916091919
ffffffffffffffffffffffffffffff 178.445468903

Yeah i know there is another solution to use very simple regexp (word that contain dots for example) and use urlparse later to get domain, but urlparse doesn’t work as expected when we don’t have protocol in URL:

>>> urlparse('example.com')
ParseResult(scheme='', netloc='', path='example.com', params='', query='', fragment='')
>>> urlparse('http://example.com')
ParseResult(scheme='http', netloc='example.com', path='', params='', query='', fragment='')
>>> urlparse('example.com/test/test')
ParseResult(scheme='', netloc='', path='example.com/test/test', params='', query='', fragment='')
>>> urlparse('http://example.com/test/test')
ParseResult(scheme='http', netloc='example.com', path='/test/test', params='', query='', fragment='')
>>> urlparse('example.com:1234/test/test')
ParseResult(scheme='example.com', netloc='', path='1234/test/test', params='', query='', fragment='')
>>> urlparse('http://example.com:1234/test/test')
ParseResult(scheme='http', netloc='example.com:1234', path='/test/test', params='', query='', fragment='')

Yeah prepending http:// is also a solution(i’m still not 100% sure if there are no other urlparse issues) but i’m curious what’s wrong with this regexp anyway

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T20:59:32+00:00Added an answer on June 9, 2026 at 8:59 pm

    I think it happens becuase of this part

    ...         r'(?:'
    ...             r'[a-z0-9]'  # first character of domain('-' not allowed)
    ...             r'(?:'
    ...                 r'[a-z0-0-]*'  #  characters in the middle of domain
    ...                 r'[a-z0-9]' #  last character of domain('-' not allowed)
    ...             r')*'
    ...             r'\.'  # dot before next part of domain name
    ...         r')+'
    

    You should not use construction like this ([set_of_symbols#1]*[set_of_symbols#2])* if set_of_symbols#1 and set_of_symbols#2 have same symbols.

    Please try to use the following code:

    ...         r'(?:'
    ...             r'[a-z0-9]'  # first character of domain('-' not allowed)
    ...             r'[a-z0-0-]*'  #  characters in the middle of domain
    ...             r'(?<=[a-z0-9])' #  last character of domain('-' not allowed)
    ...             r'\.'  # dot before next part of domain name
    ...         r')+'
    

    It should work better.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Here's the regexp: /\.([^\.]*)/g But for string name.ns1.ns2 it catches .ns1 and . ns2
I need a way of taking an equation given as a string and finding
here is my string str = asd;images30/127ef-30-30-wm.jpg;59 | asd;images30/127ef-30-30-wm.jpg;60 | and regexp is var
Here is my current situation; I am a near complete regexp illiterate, and have
I have regexp to change smileys to images. Here it is (?:(?![0]:\)|:\)\)|:-\)\)))(:\)|:-\)) The point
I have a regexp to validate file names. Here is it: /[0-9a-zA-Z\^\&\'\@\{\}\[\]\,\$\=\!\-\#\(\)\.\%\+\~\_ ]+$/ It
here is a regex i got from: a blog i can't link to because
In need of a regex master here! <img src=\img.gif style=float:left; border:0 /> <img src=\img.gif
Any regex gurus around here? its driving me crazy. Say I have this string:
After finding the fastest string replace algorithm in this thread , I've been trying

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.