As the title says it, I have a huge xml file (GBs)
<root>
<keep>
<stuff> ... </stuff>
<morestuff> ... </morestuff>
</keep>
<discard>
<stuff> ... </stuff>
<morestuff> ... </morestuff>
</discard>
</root>
and I’d like to transform it into a much smaller one which retains only a few of the elements.
My parser should do the following:
1. Parse through the file until a relevant element starts.
2. Copy the whole relevant element (with children) to the output file. go to 1.
step 1 is easy with SAX and impossible for DOM-parsers.
step 2 is annoying with SAX, but easy with the DOM-Parser or XSLT.
so what? – is there a neat way to combine SAX and DOM-Parser to do the task?
Yes, just write a SAX content handler, and when it encounters a certain element, you build a dom tree on that element. I’ve done this with very large files, and it works very well.
It’s actually very easy: As soon as you encounter the start of the element you want, you set a flag in your content handler, and from there on, you forward everything to the DOM builder. When you encounter the end of the element, you set the flag to false, and write out the result.
(For more complex cases with nested elements of the same element name, you’ll need to create a stack or a counter, but that’s still quite easy to do.)