I’ve seen other questions which will parse either all plain links, or all anchor

Question

0

Asked: May 26, 20262026-05-26T12:41:33+00:00 2026-05-26T12:41:33+00:00

I’ve seen other questions which will parse either all plain links, or all anchor

0

I’ve seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.

Ideally, the regular expression will be able to parse a string like this (I’m using Python):

>>> import re
>>> content = '
    <a href="http://www.google.com">http://www.google.com</a> Some other text.
    And even more text! http://stackoverflow.com
    '
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']

Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T12:41:34+00:00

No matter what you do, it’s going to be messy. Nevertheless, a 90% solution might resemble:

r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'

Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:

map(''.join, re.findall(pattern, content))

If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:

r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'

Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:

r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'

Once you have this much in place, you can replace any found links with something that doesn’t look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it’s particularly prone to excess greediness.)

Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve seen other questions which will parse either all plain links, or all anchor

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply