I’m writing a web crawler with the ultimate goal of creating a map of

Question

0

Asked: May 23, 20262026-05-23T10:05:44+00:00 2026-05-23T10:05:44+00:00

I’m writing a web crawler with the ultimate goal of creating a map of

0

I’m writing a web crawler with the ultimate goal of creating a map of the path the crawler has taken. While I haven’t a clue at what rate other, and most definitely better crawlers pull down pages, mine clocks about 2,000 pages per minute.

The crawler works on a recursive backtracking algorithm which I have limited to a depth of 15.
Furthermore, in order to prevent my crawler from endlessly revisitng pages, it stores the url of each page it has visited in a list, and checks that list for the next candidate url.

for href in tempUrl:
    ...
    if href not in urls:
         collect(href,parent,depth+1)

This method seems to become a problem by the time it has pulled down around 300,000 pages. At this point the crawler on average has been clocking 500 pages per minute.

So my question is, what is another method of achieving the same functionality while improving its efficiency.

I’ve thought that decreasing the size of each entry may help, so instead of appending the entire url, I append the first 2 and the last to characters of each url as a string. This, however hasn’t helped.

Is there a way I could do this with sets or something?

Thanks for the help

edit: As a side note, my program is not yet multithreaded. I figured I should resolve this bottleneck before I get into learning about threading.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T10:05:45+00:00

Editorial Team

2026-05-23T10:05:45+00:00Added an answer on May 23, 2026 at 10:05 am

Perhaps you could use a set instead of a list for the urls that you have seen so far.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a web crawler with the ultimate goal of creating a map of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply