Possible Duplicate:
Matching Nested Structures With Regular Expressions in Python
I am trying to match a single group of data from a wiki page. The bit of python code I’m using is listed below. The issue is that it returns past the end of its own group to the last }} in the page.
def findPersonInfo(self):
if (self.isPerson == True):
regex = re.compile(r"{{persondata(.*)}}",re.IGNORECASE|re.UNICODE|re.DOTALL)
result = regex.search(self._rawPage)
if result:
print 'Match found: ', result.group()
A sample of the wiki page content:
*[http://www.jsc.nasa.gov/Bios/htmlbios/acaba-jm.html NASA biography]
{{NASA Astronaut Group 19}}
{{Persondata
|NAME= Acaba, Joseph Michael "Joe"
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION=[[Hydrogeologist]]
|DATE OF BIRTH={{Birth date and age|1967|5|17}}
|PLACE OF BIRTH=[[Inglewood, California]]
|DATE OF DEATH=
|PLACE OF DEATH=
}}
{{DEFAULTSORT:Acaba, Joseph M.}}
[[Category:1967 births]]
My current regex is returning the following string:
{{Persondata
|NAME= Acaba, Joseph Michael "Joe"
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION=[[Hydrogeologist]]
|DATE OF BIRTH={{Birth date and age|1967|5|17}}
|PLACE OF BIRTH=[[Inglewood, California]]
|DATE OF DEATH=
|PLACE OF DEATH=
}}
{{DEFAULTSORT:Acaba, Joseph M.}}
I would like it to return:
{{Persondata
|NAME= Acaba, Joseph Michael "Joe"
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION=[[Hydrogeologist]]
|DATE OF BIRTH={{Birth date and age|1967|5|17}}
|PLACE OF BIRTH=[[Inglewood, California]]
|DATE OF DEATH=
|PLACE OF DEATH=
}}
The tricky bit is it needs to count other {{ opens and }} closes to know what group I want to stop at but I’m not sure how to get regex to do that.
{{persondata(.*)}}will match greedily. I.e. it will try to return the longest match possible. You should use{{persondata(.*?)}}if you want to get the shortest possible match. (Is do not have a name for this, maybe frugal matching?)However, in this case, you have another
}}inside your string. You can do something clever like{{persondata((?:.*)}}(?:.*))}}, but in general, as soon as you reach recursive structures (structures that nest themselves) you should abandon regular expressions and turn to proper parsing solutions.You might want to look at
pyparsing.