I’d like to parse and compare 2 XML files with the Python Etree parser as follows:
I have 2 XML files with loads of data. One is in English (the source file), the other one the corresponding French translation (the target file).
E.g.:
source file:
<AB>
<CD/>
<EF>
<GH>
<id>123</id>
<IJ>xyz</IJ>
<KL>DOG</KL>
<MN>dogs/dog</MN>
some more tags and info on same level
<metadata>
<entry>
<cl>Translation</cl>
<cl>English:dog/dogs</cl>
</entry>
<entry>
<string>blabla</string>
<string>blabla</string>
</entry>
some more strings and entries
</metadata>
</GH>
</EF>
<stuff/>
<morestuff/>
<otherstuff/>
<stuffstuff/>
<blubb/>
<bla/>
<blubbbla>8</blubbla>
</AB>
The target file looks exactly the same, but has no text at some places:
<MN>chiens/chien</MN>
some more tags and info on same level
<metadata>
<entry>
<cl>Translation</cl>
<cl></cl>
</entry>
The French target file has an empty cross-language reference where I’d like to put in the information from the English source file whenever the 2 macros have the same ID.
I already wrote some code in which I replaced the string tag name with a unique tag name in order to identify the cross-language reference. Now I want to compare the 2 files and if 2 macros have the same ID, exchange the empty reference in the French file with the info from the English file. I was trying out the minidom parser before but got stuck and would like to try Etree now. I have hardly any knowledge about programming and find this very hard.
Here is the code I have so far:
macros = ElementTree.parse(english)
for tag in macros.getchildren('macro'):
id_ = tag.find('id')
data = tag.find('cl')
id_dict[id_.text] = data.text
macros = ElementTree.parse(french)
for tag in macros.getchildren('macro'):
id_ = tag.find('id')
target = tag.find('cl')
if target.text.strip() == '':
target.text = id_dict[id_.text]
print (ElementTree.tostring(macros))
I am more than clueless and reading other posts on this confuses me even more. I’d appreciate it very much if someone could enlighten me 🙂
There is probably more details to be clarified. Here is the sample with some debug prints that shows the idea. It assumes that both files have exactly the same structure, and that you want to go only one level below the root:
For more complex structures of the file, the loop through the direct children of the node can be replaced by iteration through all the elements of the tree. If the structures of the files are exactly the same, the following code will work:
However, it works only in cases when both files have exactly the same structure. Let’s follow the algorithm that would be used when the task is to be done manually. Firstly, we need to find the French translation that is empty. Then it should be replaced by the English translation from the GH element with the same identification. A subset of XPath expressions is used in the case when searching for the elements: