I’m trying to parse an XML document. The document has HTML like formatting embedded, for example
<p>This is a paragraph
<em>with some <b>extra</b> formatting</em>
scattered throughout.
</p>
So far I’ve used
import xml.etree.cElementTree as xmlTree
to handle the XML document, but I am not sure if this provides the functionality I look for. How would I go about handling the text nodes here?
Also, is there a way to find the closing tags in a document?
Thanks!
If your XML document fits in memory, you should use Beautiful Soup which will give you a much cleaner access to the document. You’ll be able to select a node and automatically interact with its children; every node will have a
.nextcommand, which will iterate through the text up to the next tag.So:
That, or something like it, should solve your problem.
If it doesn’t fit in memory, you’ll need to subclass a SAX parser, which is a bit more work. To do that, you use
from xml.parsers import expatand write handlers for opening and closing of tags. It’s a bit more involved.