I’ve been experimenting with making a simple Python web crawler, and I’m using regular expressions to find the relevant links. The site I am experimenting with is a wiki, and I want to find only the links whose URLs start with /wiki/. I may expand this to some other parts of the site as well, and so I require my code to be as dynamic as possible.
The currently regex I’m using is
<a\s+href=[\'"]\/wiki\/(.*?)[\'"].*?>
However, the matches it finds do NOT include /wiki/ in them. I was unaware of this property of regular expressions. Ideally, since I may expand this to other parts of the site (eg. /bio/), I would like the regex to return “/wiki/[rest_of_url]” rather than simply “/[rest_of_url”. The regex
<a\s+href=[\'|"]\/(.*?)[\'"].*?>
works fine (it finds URLs that start with /) because it returns “/wiki/[rest_of_url]”, but it does not ensure that /wiki appears in the text.
How can I do this?
Thanks,
Daniel Moniz
Expand the parentheses so that they include the
/wiki/portion of your regexEdit
In re, parentheses allow you to break up your search results into sections. You’re telling the re parser to find the entire expression, but only return the portion in parentheses. You can also use multiple sets of parentheses:
In this case,
MatchObject.group()will return the entire matched object. If you callMatchObject.groups()however, it will return a tuple containing/wiki/and whatever matches the contents of the second parentheses. Check out the python.org documentation on regex syntax.