I’m trying to parse an XML document. The document has HTML like formatting embedded,

Question

0

Asked: June 17, 20262026-06-17T21:21:47+00:00 2026-06-17T21:21:47+00:00

I’m trying to parse an XML document. The document has HTML like formatting embedded,

0

I’m trying to parse an XML document. The document has HTML like formatting embedded, for example

<p>This is a paragraph
 <em>with some <b>extra</b> formatting</em>
 scattered throughout.
</p>

So far I’ve used

import xml.etree.cElementTree as xmlTree

to handle the XML document, but I am not sure if this provides the functionality I look for. How would I go about handling the text nodes here?

Also, is there a way to find the closing tags in a document?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T21:21:49+00:00

If your XML document fits in memory, you should use Beautiful Soup which will give you a much cleaner access to the document. You’ll be able to select a node and automatically interact with its children; every node will have a .next command, which will iterate through the text up to the next tag.

So:

>>> b = BeautifulSoup.BeautifulStoneSoup("<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>")

>>> b.find('p')
<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>

>>> b.find('p').next
u'This is a paragraph '

>>> b.find('p').next.next
<em>with some <b>extra</b> formatting</em>

That, or something like it, should solve your problem.

If it doesn’t fit in memory, you’ll need to subclass a SAX parser, which is a bit more work. To do that, you use from xml.parsers import expat and write handlers for opening and closing of tags. It’s a bit more involved.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse an XML document. The document has HTML like formatting embedded,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply