I’m dealing with quite a big XML file that I need to parse and for memory usage problems I was thinking about reading only parts of this file , is there a way to do this.Thanks.
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Depending upon the format of your data, ElementTree (here, here) or lxml (which supports the ElementTree API; here, here) will probably do what you need. It’s a bit of a hybrid between event-oriente and DOM-oriented parsers, allowing you to iterate over high-level subtrees using the iterparse() method, interrogating the internals of each subtree in turn.
This method is slower than SAX (in my use I’ve noticed it to take 2-4 times as long), but the resulting code ends up being easier to understand, maintain, and reuse. Compared to a straight-up DOM parser, as it discards visited elements during iteration, memory use is much more manageable. My experience is only with the built-in xml.etree.ElementTree library; lxml or other libraries which support the API (or perform similar functions differently) will have different characteristics.
ElementTree works well iteratively if you can easily break the document into chunks—for example, a document that contains thousands of product descriptions, where the root element contains essentially a list of products that can easily be iterated over. If, on the other hand, your documents simply contain a lot of unstructured/unparsed data, you still may have some work ahead of you to make memory usage manageable.
Hope that helps.