This question has been asked a few times on SO but I couldn’t get

Question

0

Asked: May 19, 20262026-05-19T03:38:24+00:00 2026-05-19T03:38:24+00:00

This question has been asked a few times on SO but I couldn’t get

0

This question has been asked a few times on SO but I couldn’t get any of the answers to work correctly. I need to extract all the URLs in page both in href links and the plain text. I don’t need to individual groups of the regex. I need a list of strings i.e. URLs in the page. Could someone point me to a good working example?

I’d like to do this using Regexs and not BeautifulSoup, etc.

Thank you.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T03:38:24+00:00

HTML is not a regular language, and thus cannot be parsed by regular expressions.

It’s possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).

That said, if you’re willing to go that path, see John Gruber’s regex for the purpose:

def extract_urls(your_text):
  url_re = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))')
  for match in url_re.finditer(your_text):
    yield match.group(0)

This can be used as follows:

>>> for uri in extract_urls('http://foo.bar/baz irc://freenode.org/bash'):
...   print uri
http://foo.bar/
irc://freenode.org

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

This question has been asked a few times on SO but I couldn’t get

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply