Because I hate clicking forth and back reading through Wikipedia articles I am trying to build a tool to create “expanded Wikipedia articles” according to the following algorithm:
- Create two variables:
DepthandLength. - Set a Wikipedia article as a seed page
- Parse through this article: Whenever there is a link to another article fetch the first
Lengthsentences and include it in the original article (e.g. in brackets or otherwise highlighted). - Do this recursively up to a certain
Depth, i.e. not deeper than two levels.
The result would be an article that could be read in one go without always clicking to and fro…
How would you build such a mechanism in Python? Which libraries should be used (are there any for such tasks)? Are there any helpful tutorials?
You can use urllib2 for requesting the url. For parsing the htmlpage there is wonderful library for you called BeautifulSoup. One thing you need to consider is that while scanning Wikipedia with your crawler you need to add a header alongwith your request. Or else Wikipedia will simply dissallow to be crawled.
adding header
and then load the page and give it to
BeautifulSoup.this will give you the links in a page
And now regarding the algorithm for crawling Wikipedia what you want is something called Depth Limited Search. A pseudocode is provided in the same page which is easy to follow.
And other functionality of the said libraries can be googled and are easy to follow. Good luck.