I am writing an application that works with the file system. When the app first starts, it runs a quick routine to load the requested files and folders into memory for later (time intensive) processing. (See code below). At this point it gives me a count of how many files are going to be processed, which is important in order to display the progress bar.
Once I have the count and the file data, I need to store the data for later processing (e.g. as a global variable or property or class). The problem is that it is being stored as “var” by necessity since it is using LINQ. When I break and examine the variable, it is being stored as a rather complicated mix of SelectQueryOperator and AnonymousType.
My first thought was to go ahead and loop through the data and convert it to simple data that I can store as a List<>, (e.g. store filename and path) but doing that literally takes minutes – up to 10 minutes or more – to process. I am going to have to loop through all that data later anyway in order to do the processing, and there is no way my users are going sit and wait for a list to be built up first.
How can I store this data so that I can access it later without having to convert it into something else first?
var fileNames =
from dir in Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories)
select dir;
var fileContents = from file in fileNames.AsParallel()
// Use AsOrdered to preserve source ordering
let extension = Path.GetExtension(file)
let Text = File.ReadAllText(file)
select new { Text, FileName = file };
Let’s simplify this a bit, and also make
varexplicit where we can..This is exactly the same as:
Which is exactly the same as:
Now for:
Going for a one-line-wonder doesn’t normally help readability, but it will help put our object creation all in one place for the sake of discussion:
Which is a
ParallelQuery<T>for an anonymousT. To make this something we can store we need to stop using anonymous classes:There’s now nothing stopping you from storing that in a field of type
ParallelQuery<NameAndContents>.You might want to check on the logic here though in two ways:
The workings of
Directory.EnumerateFilesis such that it needs to know the value of a given iteration in order to calculate the next. (It’s based on theFindNextFileWindows API function). This makes it poor at being parallelised. Just how much the inherent waiting involved inReadAllTextbalances that out is hard to predict. I’d not only test it against the non-parallel version, but I’d re-test after any changes made because any changes are going to throw off that balance in a new way.The biggest hit here is that
ReadAllText. If it’s at all possible to replace that with something that makes use of the text in a more on-demand way, then it could be a big win.