I’ve seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I’m using Python):
>>> import re
>>> content = '
<a href="http://www.google.com">http://www.google.com</a> Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it’s going to be messy. Nevertheless, a 90% solution might resemble:
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
If you want the
srcattribute of the anchor instead of the link text, the pattern gets even messier:Alternatively, you can just let the second half of the pattern pick up the
srcattribute, which also alleviates the need for the string join:Once you have this much in place, you can replace any found links with something that doesn’t look like a link, search for
'://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it’s particularly prone to excess greediness.)Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.