My C# application loops over 5000 files and then writes the values of xpaths to cells in an excel sheet. It is quite slow processing 40files a second.
After profiling I discovered that this line accounts for over 50% of all time used:
XmlDocument.Load(filename);
To write to excel i loop over each xpath of each file and do:
worksheet.Cells[row, col] = value;
Would it be more beneficial in terms of speed to load all the xmls into memory at once (they are less than 20kb each) then store them in a collection then transpose them all to excel?
I understanding that multi-threading would possibly reduce performance rather than improve it as the process is IO-bound.
It might not be IO bound. Most of the time is spent constructing the XML DOM. However, multi-threading would introduce a possible issue, depending on where you’re writing the results to Excel. I don’t know for sure, but I wouldn’t be surprised if you could only access the Office objects from a single thread.
You would have to add an additional step of collecting the results before writing to the Excel object. This would have to be some sort of synchronized collection, with either another thread dedicated to writing to Excel, or do it after all of the files are processed.
Now, going back to the first point: Most of the time is spent loading the DOM. Based on the results from http://www.nearinfinity.com/blogs/joe_ferner/performance_linq_to_sql_vs.html If you still need DOM related methods, I would look at using XDocument instead. The interface isn’t that far off XmlDocument, so it should be an easy adaption.
For the most speed processing XML, look into XmlReader. However, this does not get you any DOM functions, and can be harder to deal with than the two DOM based methods.
So, in short, first try converting to the XDocument methods, that might roughly double your speed. I would then look at converting the processing to multithreaded (perhaps using PLINQ over the list of files). Finally, if performance is still not enough, try using the XmlReader interface.
EDIT in response to collection types to use:
I see two basic options for this, depending on how long it takes to process the XML files. If it is a small percentage of the overall process (most time is spent dealing with Excel), just have a
List<T>whereTis some representation of the data you need to write to excel (It could even be a string if that’s all you need), with the.Addmethods surrounded bylock‘s. Then once XML processing is complete, the Excel writer iterates over this collection.Another option if XML processing takes awhile, and you’re on .Net 4, look at the
ConcurrentQueueclass. This will provide thread safety on it’s own (and really now that I look, one of the Concurrent collections could be used in the first case too, eitherConcurrentQueueorBlockingCollection). You would then have threads running processing XML, and then a consumer thread that writes out to Excel.A few other things. Expanding a comment on a question, if you’re doing nothing that needs Excel specific functions, you could just write out to CSV. The library here http://www.codeproject.com/Articles/86973/C-CSV-Reader-and-Writer is rather straightforward to use, and handles embedded commas. The downside of this is the Big Scary Dialogs excel throws up if you try to save a CSV. These might be overcome with user training, however.
Another option would be to use the OpenXML library to generate Excel files if you’re targeting at least Excel 2007 (Although Excel 2003 can read xlsx files with an addin), provided you aren’t already. I imagine that, since this library manipulates XML it would be faster than dealing with Excel interop, and also safer (no dialogs from Excel, no zombie processes, etc).