I’m using mwlib in Python to iterate over a Wikipedia dump. I want to ignore redirects and just look at page contents with the actual full title. I’ve already run mw-buildcdb, and I’m loading that:
wiki_env = wiki.makewiki(wiki_conf_file)
When I loop over wiki_env.wiki.articles(), the strings appear to contain redirect titles (I’ve checked this on a couple of samples against Wikipedia). I don’t see an accessor that skips these, and wiki_env.wiki.redirects is an empty dictionary, so I can’t check which article titles are actually just redirects that way.
I’ve tried looking through the mwlib code, but if I use
page = wiki_env.wiki.get_page(page_title)
wiki_env.wiki.nshandler.redirect_matcher(page.rawtext)
the page.rawtext appears to already be redirected (containing the full page content, and no indication that there is a title mismatch). Similarly the Article node returned by getParsedArticle() does not appear to contain the “true” title to check against.
Anyone know how to do this? Do I need to run mw-buildcdb in a way to not store redirects? As far as I can tell that command just takes an input dump file and an output CDB, with no other options.
When in doubt, patch it yourself. :o)
mw-buildcdb now takes an –ignore-redirects command-line option: https://github.com/pediapress/mwlib/commit/f9198fa8288faf4893b25a6b1644e4997a8ff9b2