I am working on a C# program that reads in very large files and is checking them for different attributes and fields. I had been testing with files with under 1 million lines and it was preforming as expected. I have recently tested it on a file with 2.5 million lines and it took 4 hours to run through.
I am using a custom Reading function to read in each character so that I can find all CR and LF because it is very important that every line contains them. I have tested the Reading function separately and it look about 14 minutes to read the file, which I find reasonable enough to read every character in a 2.5 million lines with 1500 characters. I will included me Reading function, however this doesn’t seem to be causing the issue.
My reading function adds each character to a string and then I check different values in the string. For example, is line length is correct, does file contains a header, and does the header contain the correct values. As well as specific values like is char position 403-404 a number, is field 1250-1300 not null, etc.
My question is what can I do to figure out what is causing the slow down and increase my efficiency of my program? I have tried checking the time at the beginning and end of each line loop and it doesn’t seem to change. However, every 100,000 takes significantly longer than the previous. As an example, processing line 10,000 to 20,000 took less than 3 seconds and 830,000 to 840,000 took about 35 seconds. I have considered trying to multiple threads but don’t think it will help in my case with reading lines from a file. Thoughts? Thanks for the help!
static void ReadMyLine(ref string currentLine, string filePath, ref int asciiValue, ref Boolean isMissingCR, ref Boolean isMissingLF, ref Boolean isReversed, ref StreamReader file)
{
Boolean endOfRow = false;
isMissingCR = false;
isMissingLF = false;
isReversed = false;
currentLine = "";
while (endOfRow == false)
{
asciiValue = file.Read();
if (asciiValue == 10 || asciiValue == 13)
{
int asciiValueTemp = file.Peek();
if (asciiValue == 13 && asciiValueTemp == 10)
{
endOfRow = true;
asciiValue = file.Read();
}
else if (asciiValue == 10 && asciiValueTemp == 13) // CRLF Reversed
{
asciiValue = file.Read();
endOfRow = true;
isReversed = true;
}
else if (asciiValue == 10) // Missing CR
{
isMissingCR = true;
endOfRow = true;
}
else if (asciiValue == 13) // Missing LF
{
isMissingLF = true;
endOfRow = true;
}
else
endOfRow = true;
}
else if (asciiValue != -1)
currentLine += char.ConvertFromUtf32(asciiValue);
else
endOfRow = true;
}
}
Here’s the first thing I looked for, and the first thing I’d change:
Don’t do that. Using string concatenation in a loop can kill performance – you get O(N2) complexity. Use a
StringBuilderinstead. See my article on when to useStringBuilderfor more explanation.There may well be more you can do, but just changing to use
StringBuilderwill be a huge improvement:It’s also unclear why you have so many ref parameters. Why are you passing in
asciiValueat all? Why are you passing theStreamReaderby reference? Anything using this many ref parameters makes me very nervous – why don’t you have a type which encapsulates everything you really want to return from the method?You may want to read my article on parameter passing to get a better understanding of
ref.