This question has been asked a few times on SO but I couldn’t get any of the answers to work correctly. I need to extract all the URLs in page both in href links and the plain text. I don’t need to individual groups of the regex. I need a list of strings i.e. URLs in the page. Could someone point me to a good working example?
I’d like to do this using Regexs and not BeautifulSoup, etc.
Thank you.
HTML is not a regular language, and thus cannot be parsed by regular expressions.
It’s possible to make reasonable guesses using regular expressions, and/or to recognize a restricted subset of URIs, but that way lies madness (lengthy debugging processes, inaccurate results).
That said, if you’re willing to go that path, see John Gruber’s regex for the purpose:
This can be used as follows: