I am using .NET to parse an XML file with about 20 million lines (1.56 GB), building LINQ objects out of the data, then inserting it into a SQL database. It is taking a really long time.
To improve performance I am considering asking for a pipe delimited file. I was also wondering if Perl might be any faster. Does anyone have suggestions for speeding up this process?
Here’s a radical thought and I honestly don’t know if it’ll improve your performance but it has a good chance to do so. I’ll bet that you’re instantiating your context object once and then using it to insert all the records, right? Doing so means that the context will keep track of all those objects until it is disposed and that might explain the degrading performance over time.
Now, you could clear the context cache but I have a nuttier idea. Those context objects are designed to have minimal overhead on instantiation (it says so in the documentation anyway, I haven’t tested that assertion), so it might help if you instantiate the context on each iteration. i.e. re-create the context at the same time you create the object. Or, and this is a better idea, maintain an internal list of your data objects and kick the list to a separate method every n iterations to commit to the database. That method should both instantiate and dispose the context. Make sense?