I hope this question is not a RTFM one. I am trying to write

Question

0

Asked: May 11, 20262026-05-11T02:27:44+00:00 2026-05-11T02:27:44+00:00

I hope this question is not a RTFM one. I am trying to write

0

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the <link href... tags). I have searched the web for matching regexen and found many different patterns. Is there any agreed, standard regex to match links?

Adam

UPDATE: I am actually looking for two different answers:

What’s the library solution for parsing HTML links. Beautiful Soup seems to be a good solution (thanks, Igal Serban and cletus!)
Can a link be defined using a regex?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T02:27:45+00:00

As others have suggested, if real-time-like performance isn’t necessary, BeautifulSoup is a good solution:

import urllib2 from BeautifulSoup import BeautifulSoup  html = urllib2.urlopen('http://www.google.com').read() soup = BeautifulSoup(html) all_links = soup.findAll('a')

As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

Regex, for the reasons above (the parser must maintain state, and regex can’t do that) will never be a general solution.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I hope this question is not a RTFM one. I am trying to write

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply