I am trying to parse a huge XML file ranging from (20MB-3GB). Files are samples coming from different Instrumentation. So, what I am doing is finding necessary element information from file and inserting them to database (Django).
Small part of my file sample. Namespace exist in all files. Interesting feature of files are they have more node attributes then text
<?xml VERSION="1.0" encoding="ISO-8859-1"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" accession="plgs_example" version="1.1.0" id="urn:lsid:proteios.org:mzml.plgs_example">
<instrumentConfiguration id="QTOF">
<cvParam cvRef="MS" accession="MS:1000189" name="Q-Tof ultima"/>
<componentList count="4">
<source order="1">
<cvParam cvRef="MS" accession="MS:1000398" name="nanoelectrospray"/>
</source>
<analyzer order="2">
<cvParam cvRef="MS" accession="MS:1000081" name="quadrupole"/>
</analyzer>
<analyzer order="3">
<cvParam cvRef="MS" accession="MS:1000084" name="time-of-flight"/>
</analyzer>
<detector order="4">
<cvParam cvRef="MS" accession="MS:1000114" name="microchannel plate detector"/>
</detector>
</componentList>
</instrumentConfiguration>
Small but complete file is here
So what I have done till now is using findall for every element of interest.
import xml.etree.ElementTree as ET
tree=ET.parse('plgs_example.mzML')
root=tree.getroot()
NS="{http://psi.hupo.org/ms/mzml}"
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
for ins in range(len(s)):
insattrib=s[ins].attrib
# It will print out all the id attribute of instrument
print insattrib["id"]
How can I access all children/grandchildren of instrumentConfiguration (s) element?
s=tree.findall('.//{http://psi.hupo.org/ms/mzml}instrumentConfiguration')
Example of what I want
InstrumentConfiguration
-----------------------
Id:QTOF
Parameter1: T-Tof ultima
source:nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate decector
Is there efficient way of parsing element/subelement/subelement when namespace exist? Or do I have to use find/findall every time to access particular element in the tree with namespace? This is just a small example I have to parse more complex element hierarchy.
Any suggestions!
Edit
Didn’t got the correct answer so have to edit once more!
Here’s a script that parses one million
<instrumentConfiguration/>elements (967MBfile) in40seconds (on my machine) without consuming large amount of memory.The throughput is
24MB/s. ThecElementTree page (2005)reports47MB/s.Output
Note: The code is fragile it assumes that the first two children of
<instrumentConfiguration/>are<cvParam/>and<componentList/>and all values are available as tag names or attributes.On performance
ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.
If you replace
root.clear()byelem.clear()then the code is ~10% faster but ~10 times more memory.lxml.etreeworks withelem.clear()variant, the performance is the same as forcElementTreebut it consumes 20 (root.clear()) / 2 (elem.clear()) times as much memory (500MB).