I would like to use Python to extract content formatted in MediaWiki markup following a particular string. For example, the 2012 U.S. presidential election article, contains fields called “nominee1” and “nominee2”. Toy example:
In [1]: markup = get_wikipedia_markup('United States presidential election, 2012')
In [2]: markup
Out[2]:
u"{{
| nominee1 = '''[[Barack Obama]]'''\n
| party1 = Democratic Party (United States)\n
| home_state1 = [[Illinois]]\n
| running_mate1 = '''[[Joe Biden]]'''\n
| nominee2 = [[Mitt Romney]]\n
| party2 = Republican Party (United States)\n
| home_state2 = [[Massachusetts]]\n
| running_mate2 = [[Paul Ryan]]\n
}}"
Using the election article above as an example, I would like to extract the information immediately following the “nomineeN” field but that exists before the invocation of the next field (demarcated by a pip “|”). Thus, given the example above, I would ideally like to extract “Barack Obama” and “Mitt Romney” — or at least the syntax in which they’re embedded (”'[[Barack Obama]]”’ and [[Mitt Romney]]). Other regex has extracted links from the wikimarkup, but my (failed) attempts of using a positive lookbehind assertion have been something of the flavor of:
nominees = re.findall(r'(?<=\|nominee\d\=)\S+',markup)
My thinking is that it should find strings like “|nominee1=” and “|nominee2=” with some whitespace possible between “|”, “nominee”, “=” and then return the content following it like “Barack Obama” and “Mitt Romney”.
Lookbehinds aren’t necessary here—it’s much easier to use matching groups to specify exactly what should be extracted from the string. (In fact, lookbehinds can’t work here with Python’s regular expression engine, since the optional spaces make the expression variable-width.)
Try this regex:
Results: