I have searched around but not gotten much help. Here’s my problem. I want to start from a portal page on wikipedia, say Computer_science and go its categories pages. There are some pages in that category and there are links to subcategories. I will visit some of these pages and get the page abstracts alone. Then go to the next level with pointers from this categories page and so on.
I know C++/php/js/python. Which fits best here? I’d like to do this in a day. I know there’s an api, but it doesn’t seem helpful for getting content.
- I need to get pages
- Parse them to get to the categories div (or element as provided by raw wiki data) for getting the abstracts as well as going to other pages.
I need suggestions for programming languages, libraries and public code if available.
I also heard wiki don’t like bot crawlers, I am planning to get may 500 docs at most. Is that a problem?
Thanks a lot
There isn’t necessarily a category corresponding to a portal, although you could try looking for a category with the same name as the portal, the categories the portal page is in (using the API, you can query this with
prop=categories), or the category pages linked from the portal page (prop=links&plnamespace=14).Any of those languages would work. You could also pick perl, java, C#, objective-c, or just about any other language. A list of frameworks of varying quality can be found here or here.
The API can certainly give you content, using
prop=revisions. You can even query just the “lead” section withrvsection=0. The API can also give you the list of pages in a category withlist=categorymembersand the list of categories for a page usingprop=categories.500 pages shouldn’t be an issue. If you were to be wanting a significant proportion of the articles, you’d want to look into using a database dump instead.
See the API documentation for details.