I need to iterate through a large XML file (~2GB) and selectively copy certain nodes to one or more separate XML files.
My first thought is to use XPath to iterate through matching nodes and for each node test which other file(s) the node should be copied to, like this:
var doc = new XPathDocument(@"C:\Some\Path.xml");
var nav = doc.CreateNavigator();
var nodeIter = nav.Select("//NodesOfInterest");
while (nodeIter.MoveNext())
{
foreach (Thing thing in ThingsThatMightGetNodes)
{
if (thing.AllowedToHaveNode(nodeIter.Current))
{
thing.WorkingXmlDoc.AppendChild(... nodeIter.Current ...);
}
}
}
In this implementation, Thing defines public System.Xml.XmlDocument WorkingXmlDoc to hold nodes that it is AllowedToHave(). I don’t understand, though, how to create a new XmlNode that is a copy of nodeIter.Current.
If there’s a better approach I would be glad to hear it as well.
Evaluation of an XPath expression requires that the whole XML document (XML Infoset) be in RAM.
For an XML file whose textual representation exceeds 2GB, typically more than 10GB of RAM should be available just to hold the XML document.
Therefore, while not impossible, it may be preferrable (especially on a server that must have resources quickly available to many requests) to use another technique.
The XmlReader (based classes) is an excellent tool for this scenario. It is fast, forward only, and doesn’t require to retain the read nodes in memory. Also, your logic will remain almost the same.