I have some XML that is generated by a script that may or may not have empty elements. I was told that now we cannot have empty elements in the XML. Here is an example:
<customer>
<govId>
<id>@</id>
<idType>SSN</idType>
<issueDate/>
<expireDate/>
<dob/>
<state/>
<county/>
<country/>
</govId>
<govId>
<id/>
<idType/>
<issueDate/>
<expireDate/>
<dob/>
<state/>
<county/>
<country/>
</govId>
</customer>
The output should look like this:
<customer>
<govId>
<id>@</id>
<idType>SSN</idType>
</govId>
</customer>
I need to remove all the empty elements. You’ll note that my code took out the empty stuff in the “govId” sub-element, but didn’t take out anything in the second. I am using lxml.objectify at the moment.
Here is basically what I am doing:
root = objectify.fromstring(xml)
for customer in root.customers.iterchildren():
for e in customer.govId.iterchildren():
if not e.text:
customer.govId.remove(e)
Does anyone know of a way to do this with lxml objectify or is there an easier way period? I would also like to remove the second “govId” element in its entirety if all its elements are empty.
First of all, the problem with your code is that you are iterating over
customers, but not overgovIds. On the third line you take the firstgovIdfor every customer, and iterate over its children. So, you’d need a anotherforloop for the code to work like you intended it to.This small sentence at the end of your question then makes the problem quite a bit more complex: I would also like to remove the second “govId” element in its entirety if all its elements are empty.
This means, unless you want to hard code just checking one level of nesting, you need to recursively check if an element and it’s children are empty. Like this for example:
Note: Python 2.5+ because of the use of the
all()builtin.You then can change your code to something like this to remove all the elements in the document that are empty all the way down.
Sample output:
One thing you might want to do is refine the condition
if e.text:in the recursive function. Currently this will considerNoneand the empty string as empty, but not whitespace like spaces and newlines. Usestr.strip()if that’s part of your definition of “empty”.Edit: As pointed out by @Dave, the recursive function could be improved by using a generator expression:
This will not evaluate
recursively_empty(c)for all the children at once, but evaluate it for each one lazily. Sinceall()will stop iteration upon the firstFalseelement, this could mean a significant performance improvement.Edit 2: The expression can be further optimized by using
e.iterchildren()instead ofe.getchildren(). This works with the lxml etree API and the objectify API.