I’m making a Python script that verifies if a Wikipedia link chain is valid. For instance, the chain
List of jōyō kanji > Elementary schools in Japan > Education > Knowledge
is a valid one since you can reach each page only by clicking links.
The issue here is that these pages are really long and downloading the entire page, checking if the link is in the page and repeating all the steps will take a long time. And the chains could be longer too.
So what I want to know is if I can use urllib2 (or any other library) to download each page and tell it to stop when needed or if this would just put more load on the CPU and make things worse.
I couldn’t find a way of doing this with urllib2, but there’s one obvious solution using raw sockets:
This way you stop downloading once you find the relevant data and avoid downloading unnecessary content from large web pages.