I have a large xml file (about 84MB) which is in this form:
<books>
<book>...</book>
....
<book>...</book>
</books>
My goal is to extract every single book and get its properties. I tried to parse it (as I did with other xml files) as follows:
from xml.dom.minidom import parse, parseString
fd = "myfile.xml"
parser = parse(fd)
## other python code here
but the code seems to fail in the parse instruction. Why is this happening and how can I solve this?
I should point out that the file may contain greek, spanish and arabic characters.
This is the output i got in ipython:
In [2]: fd = "myfile.xml"
In [3]: parser = parse(fd)
Killed
I would like to point out that the computer freezes during the execution, so this may be related to memory consumption as stated below.
I would strongly recommend using a SAX parser here. I wouldn’t recommend using
minidomon any XML document larger than a few megabytes; I’ve seen it use about 400MB of RAM reading in an XML document that was about 10MB in size. I suspect the problems you are having are being caused byminidomrequesting too much memory.Python comes with an XML SAX parser. To use it, do something like the following.
Your
ContentHandlersubclass will override various methods in ContentHandler (such asstartElement,startElementNS,endElement,endElementNSorcharacters. These handle events generated by the SAX parser as it reads your XML document in.SAX is a more ‘low-level’ way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside. On the upside, however, as SAX parsers don’t keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.
I haven’t tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document. That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.
Finally, I don’t believe the Greek, Spanish and Arabic characters you mention will cause a problem.