I have searched around but not gotten much help. Here’s my problem. I want

Question

0

Asked: May 20, 20262026-05-20T14:19:30+00:00 2026-05-20T14:19:30+00:00

I have searched around but not gotten much help. Here’s my problem. I want

0

I have searched around but not gotten much help. Here’s my problem. I want to start from a portal page on wikipedia, say Computer_science and go its categories pages. There are some pages in that category and there are links to subcategories. I will visit some of these pages and get the page abstracts alone. Then go to the next level with pointers from this categories page and so on.

I know C++/php/js/python. Which fits best here? I’d like to do this in a day. I know there’s an api, but it doesn’t seem helpful for getting content.

I need to get pages
Parse them to get to the categories div (or element as provided by raw wiki data) for getting the abstracts as well as going to other pages.

I need suggestions for programming languages, libraries and public code if available.
I also heard wiki don’t like bot crawlers, I am planning to get may 500 docs at most. Is that a problem?

Thanks a lot

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T14:19:31+00:00

There isn’t necessarily a category corresponding to a portal, although you could try looking for a category with the same name as the portal, the categories the portal page is in (using the API, you can query this with prop=categories), or the category pages linked from the portal page (prop=links&plnamespace=14).

Any of those languages would work. You could also pick perl, java, C#, objective-c, or just about any other language. A list of frameworks of varying quality can be found here or here.

The API can certainly give you content, using prop=revisions. You can even query just the “lead” section with rvsection=0. The API can also give you the list of pages in a category with list=categorymembers and the list of categories for a page using prop=categories.

500 pages shouldn’t be an issue. If you were to be wanting a significant proportion of the articles, you’d want to look into using a database dump instead.

See the API documentation for details.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have searched around but not gotten much help. Here’s my problem. I want

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply