I’ve been experimenting with making a simple Python web crawler, and I’m using regular

Question

0

Asked: May 29, 20262026-05-29T19:25:49+00:00 2026-05-29T19:25:49+00:00

I’ve been experimenting with making a simple Python web crawler, and I’m using regular

0

I’ve been experimenting with making a simple Python web crawler, and I’m using regular expressions to find the relevant links. The site I am experimenting with is a wiki, and I want to find only the links whose URLs start with /wiki/. I may expand this to some other parts of the site as well, and so I require my code to be as dynamic as possible.

The currently regex I’m using is

<a\s+href=[\'"]\/wiki\/(.*?)[\'"].*?>

However, the matches it finds do NOT include /wiki/ in them. I was unaware of this property of regular expressions. Ideally, since I may expand this to other parts of the site (eg. /bio/), I would like the regex to return “/wiki/[rest_of_url]” rather than simply “/[rest_of_url”. The regex

<a\s+href=[\'|"]\/(.*?)[\'"].*?>

works fine (it finds URLs that start with /) because it returns “/wiki/[rest_of_url]”, but it does not ensure that /wiki appears in the text.

How can I do this?

Thanks,

Daniel Moniz

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T19:25:50+00:00

Expand the parentheses so that they include the /wiki/ portion of your regex

    <a\s+href=[\'"](\/wiki\/.*?)[\'"].*?>

Edit

In re, parentheses allow you to break up your search results into sections. You’re telling the re parser to find the entire expression, but only return the portion in parentheses. You can also use multiple sets of parentheses:

    <a\s+href=[\'"](\/wiki\/)(.*?)[\'"].*?>

In this case, MatchObject.group() will return the entire matched object. If you call MatchObject.groups() however, it will return a tuple containing /wiki/ and whatever matches the contents of the second parentheses. Check out the python.org documentation on regex syntax.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve been experimenting with making a simple Python web crawler, and I’m using regular

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply