I have a file that has many xml-like elements such as this one:
<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>
I need to parse the docid and the text. What’s a suitable regular expression for that?
I’ve tried this but it doesn’t work:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)
EDIT: I’ve modified the pattern like this:
<document docid=(\d+)>(.*)</document>
This matches the whole document unfortunately not the individual document elements.
EDIT2: The correct implementation from Ahmad’s and Acorn’s answer is:
collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)
Your pattern is greedy, so if you have multiple
<document>elements it will end up matching all of them.You can make it non-greedy by using
.*?, which means “match zero or more characters, as few as possible.” The updated pattern is: