I try to get the whole content between an opening xml tag and it’s closing counterpart.
Getting the content in straight cases like title below is easy, but how can I get the whole content between the tags if mixed-content is used and I want to preserve the inner tags?
<?xml version="1.0" encoding="UTF-8"?>
<review>
<title>Some testing stuff</title>
<text sometimes="attribute">Some text with <extradata>data</extradata> in it.
It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag>
or more</sometag>.</text>
</review>
What I want is the content between the two text tags, including any tags: Some text with <extradata>data</extradata> in it. It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag> or more</sometag>.
For now I use regular expressions but it get’s kinda messy and I don’t like this approach. I lean towards a XML parser based solution. I looked over minidom, etree, lxml and BeautifulSoup but couldn’t find a solution for this case (whole content, including inner tags).
The trick here is that
tis iterable, and when iterated, yields all child nodes. Because etree avoids text nodes, you also need to recover the text before the first child tag, witht.text.Or: