I have a few very large files each of 500MB++ size, containing integer values (in fact it’s a bit more complex), I’m reading those files in a loop and calculating the max value for all files. By some reason the memory is growing constantly during the processing, it looks like GC never releases the memory, acquired by the previous instances of lines.
I cannot stream the data and have to use GetFileLines for each file. Provided the actual amount of memory required to store lines for one file is 500MB, why do I get 5GB of RAM used after 10 files being processed? Eventually it crashes with Out of Memory exception after 15 files.
Calculation:
int max = int.MinValue;
for (int i = 0; i < 10; i++)
{
IEnumerable<string> lines = Db.GetFileLines(i);
max = Math.Max(max, lines.Max(t=>int.Parse(t)));
}
GetFileLines code:
public static List<string> GetFileLines(int i)
{
string path = GetPath(i);
//
List<string> lines = new List<string>();
string line;
using (StreamReader reader = File.OpenText(path))
{
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
}
reader.Close();
reader.Dispose(); // should I bother?
}
return lines;
}
For very large file, method
ReadLineswould be the best fit because it is deferred execution, it does not load all lines in memory and simple to use:More information:
http://msdn.microsoft.com/en-us/library/dd383503.aspx
Edit:
This is how
ReadLinesimplement behind the scene:Also, it is recommended using parallel processing to improve performance when you have multiple files