I am new to LINQ and I am currently using it to process big datasets in csv format (half a million records). I am using a StreamReader to open the files and implement the IEnumerable<> interface in order to populate the results. Below you can see the main part of the reading code:
IEnumerator<Person> IEnumerable<Person>.GetEnumerator()
{
using (StreamReader streamReader = new StreamReader(filename)){
streamReader.ReadLine();
while (!streamReader.EndOfStream){
string[] values = streamReader.ReadLine().Split(new char[] { ',' });
Person p = new Person();
p.Name = values[0];
p.Age = Convert.ToInt16(values[1]);
p.Score = Convert.ToDouble(values[2]);
p.PlotArea = Convert.ToInt16(values[3]);
p.ForecastConsumption = Convert.ToDouble(values[4]);
p.Postcode = values[5];
p.PropertyType = values[6];
p.Bedrooms = Convert.ToInt16(values[7]);
p.Occupancy = Convert.ToInt16(values[8]);
yield return p;
}
}
}
and here is a typical query:
var query = from person in reader
where person.Score > 36.55 && person.Bedrooms < 3
select person;
My question is this, every time I want to run a query the StreamReader has to open the file. Is there any way that I can open the file once and run multiple queries?
FYI I am very impressed with LINQ, it takes 1.2 seconds to run the query above. It’s just that I will run a lot of rules for the datasets.
Well the simplest way would be to load the whole file into a list, e.g.
Obviously that will take quite a lot of memory, but it’s going to be the simplest way to go. If you want to join multiple queries together, you’ll have to work out exactly what you want to do – the composition model in LINQ is mostly in terms of chaining query operations together rather than building multiple queries from the same source.
Failing that, if neither the complexity of the “multiple queries in one pass” nor the “load the whole file into memory” works for you, you’re likely to be stuck with loading it multiple times.
One in-between option which may be more memory efficient would be to read all the lines into memory (so you only perform the disk activity once) but then parse those lines multiple times. That will be much more efficient in terms of IO, but worse in terms of CPU.