I would like to use Python to extract content formatted in MediaWiki markup following

Question

0

Asked: June 16, 20262026-06-16T02:08:42+00:00 2026-06-16T02:08:42+00:00

I would like to use Python to extract content formatted in MediaWiki markup following

0

I would like to use Python to extract content formatted in MediaWiki markup following a particular string. For example, the 2012 U.S. presidential election article, contains fields called “nominee1” and “nominee2”. Toy example:

In [1]: markup = get_wikipedia_markup('United States presidential election, 2012')
In [2]: markup
Out[2]:
u"{{
| nominee1 = '''[[Barack Obama]]'''\n
| party1 = Democratic Party (United States)\n
| home_state1 = [[Illinois]]\n
| running_mate1 = '''[[Joe Biden]]'''\n
| nominee2 = [[Mitt Romney]]\n
| party2 = Republican Party (United States)\n
| home_state2 = [[Massachusetts]]\n
| running_mate2 = [[Paul Ryan]]\n
}}"

Using the election article above as an example, I would like to extract the information immediately following the “nomineeN” field but that exists before the invocation of the next field (demarcated by a pip “|”). Thus, given the example above, I would ideally like to extract “Barack Obama” and “Mitt Romney” — or at least the syntax in which they’re embedded (”'[[Barack Obama]]”’ and [[Mitt Romney]]). Other regex has extracted links from the wikimarkup, but my (failed) attempts of using a positive lookbehind assertion have been something of the flavor of:

nominees = re.findall(r'(?<=\|nominee\d\=)\S+',markup)

My thinking is that it should find strings like “|nominee1=” and “|nominee2=” with some whitespace possible between “|”, “nominee”, “=” and then return the content following it like “Barack Obama” and “Mitt Romney”.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T02:08:44+00:00

Lookbehinds aren’t necessary here—it’s much easier to use matching groups to specify exactly what should be extracted from the string. (In fact, lookbehinds can’t work here with Python’s regular expression engine, since the optional spaces make the expression variable-width.)

Try this regex:

\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?

Results:

re.findall(r"\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?", markup)
# => ['Barack Obama', 'Mitt Romney']

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I would like to use Python to extract content formatted in MediaWiki markup following

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply