Because I hate clicking forth and back reading through Wikipedia articles I am trying

Question

0

Asked: June 12, 20262026-06-12T00:35:10+00:00 2026-06-12T00:35:10+00:00

Because I hate clicking forth and back reading through Wikipedia articles I am trying

0

Because I hate clicking forth and back reading through Wikipedia articles I am trying to build a tool to create “expanded Wikipedia articles” according to the following algorithm:

Create two variables: Depth and Length.
Set a Wikipedia article as a seed page
Parse through this article: Whenever there is a link to another article fetch the first Length sentences and include it in the original article (e.g. in brackets or otherwise highlighted).
Do this recursively up to a certain Depth, i.e. not deeper than two levels.

The result would be an article that could be read in one go without always clicking to and fro…

How would you build such a mechanism in Python? Which libraries should be used (are there any for such tasks)? Are there any helpful tutorials?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T00:35:11+00:00

You can use urllib2 for requesting the url. For parsing the htmlpage there is wonderful library for you called BeautifulSoup. One thing you need to consider is that while scanning Wikipedia with your crawler you need to add a header alongwith your request. Or else Wikipedia will simply dissallow to be crawled.

 request = urllib2.Request(page)

adding header

 request.add_header('User-agent', 'Mozilla/5.0 (Linux i686)')

and then load the page and give it to BeautifulSoup.

 soup = BeautifulSoup(response)  
 text = soup.get_text()

this will give you the links in a page

 for url in soup.find_all('a',attrs={'href': re.compile("^http://")}):  
       link = url['href']

And now regarding the algorithm for crawling Wikipedia what you want is something called Depth Limited Search. A pseudocode is provided in the same page which is easy to follow.

And other functionality of the said libraries can be googled and are easy to follow. Good luck.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Because I hate clicking forth and back reading through Wikipedia articles I am trying

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply