I have an xml file like this:
<data>
<entry>
<word>ABC</word> (this)
</entry>
<entry>
<word>ABC</word> [not this]
</entry>
</data>
I want to select nodes whose descendant include “(“, and move (.*) to the text of <entry>. That is:
<data>
<entry>
(this)
<word>ABC</word>
</entry>
<entry>
<word>ABC</word> [not this]
</entry>
</data>
I’m using lxml. And I tried:
import lxml.etree as ET
data = ET.parse('sample.xml')
for entry in data.iter('entry'):
A = entry.xpath('.//*[text() = ".*(.*?)"]')
But it doesn’t work. “(” can appear as a tail of a node or as a text of a node.
There are a few problems here:
First, you’re trying to use xpath to do regex matching, but you’re using =. Your regex is also improperly formatted. To actually do regex matching in xpath, you’d need to do something like:
Unfortunately, this won’t actually work for you because you want text in the tail, which isn’t included in
text(). The lxml docs make it look like this should be included instring(), but I tried that and it doesn’t work either. I can’t find any way to do this using xpath and lxml.So, here’s a way to do it with more Python and less xpath:
In either case, a happy side effect is that this regex is having a totally rockin’ good time in the snow:
.*\(.*\).*.