I have the following code to optimize. As I expect the file to be large, I did not use a HashMap to store the lines but opted instead for a String array. I tried testing the logic with n of about 500,000 and it ran for approximately 14 minutes. I would definitely like to make it a lot faster than that and would appreciate any help or suggestions.
public static void RemoveDuplicateEntriesinFile(string filepath)
{
if (filepath == null)
throw new ArgumentException("Please provide a valid FilePath");
String[] lines = File.ReadAllLines(filepath);
for (int i = 0; i < lines.Length; i++)
{
for (int j = (i + 1); j < lines.Length; j++)
{
if ((lines[i] !=null) && (lines[j]!=null) && lines[i].Equals(lines[j]))
{//replace duplicates with null
lines[j] = null;
}
}
}
File.WriteAllLines(filepath, lines);
}
Thanks in Advance!
“As I expect the file to be large, I did not use a HashMap to store the lines but opted instead for a String array.”
I don’t agree with your reasoning; the larger the file, the more of a performance benefit you’ll get from hashing. In your code, you’re comparing each lines with all succeeding lines, requiring O(n²) computational complexity for the whole file.
On the other hand, if you were to use an efficient hashing algorithm, then each hash lookup would complete in O(1); the computational complexity of processing your entire file becomes O(n).
Try using a
HashSet<string>and observe the difference in the processing time:Edit: Could you try the following version of the algorithm and check how long it takes? It’s optimized to minimize memory consumption: