I have a file that has many xml-like elements such as this one: <document

Question

0

Asked: May 26, 20262026-05-26T19:24:34+00:00 2026-05-26T19:24:34+00:00

I have a file that has many xml-like elements such as this one: <document

0

I have a file that has many xml-like elements such as this one:

<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>

I need to parse the docid and the text. What’s a suitable regular expression for that?

I’ve tried this but it doesn’t work:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)

EDIT: I’ve modified the pattern like this:

<document docid=(\d+)>(.*)</document>

This matches the whole document unfortunately not the individual document elements.

EDIT2: The correct implementation from Ahmad’s and Acorn’s answer is:

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T19:24:35+00:00

Editorial Team

2026-05-26T19:24:35+00:00Added an answer on May 26, 2026 at 7:24 pm

Your pattern is greedy, so if you have multiple <document> elements it will end up matching all of them.

You can make it non-greedy by using .*?, which means “match zero or more characters, as few as possible.” The updated pattern is:

<document docid=(\d+)>(.*?)</document>

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a file that has many xml-like elements such as this one: <document

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply