I need to split am XML file (~400 MB) in two, so that a legacy app can process the file. At the moment its throwing an exception when the file is over around 300 MB.
As I can’t change the app which is doing the processing, I thought I could write a console app to split the file in two first. What’s the best way of doing this? It needs to be automated so I can’t use a text editor, and I’m using C#.
I suppose the considerations are:
- writing a header to the new files after the split
- finding a good place to split (not in middle of ‘object’)
- closing off tags and file correctly in first file, opening tags correctly in second file
Any suggestions?
The “best” way is likely to be based on
XmlReaderandXmlWriter. Using these “streaming” APIs avoids needing to load the whole XML object model in memory (and with DOM –XmlDocument– that can need considerably more memory than the text data).Using these APIs is harder than just loading the document: your implementation needs to track the context (eg. current node and ancestor list), but in this case that wouldn’t be complex (just enough to open the elements to the current state when opening each output document).