I have the following code to optimize. As I expect the file to be

Question

0

Asked: June 3, 20262026-06-03T17:59:17+00:00 2026-06-03T17:59:17+00:00

I have the following code to optimize. As I expect the file to be

0

I have the following code to optimize. As I expect the file to be large, I did not use a HashMap to store the lines but opted instead for a String array. I tried testing the logic with n of about 500,000 and it ran for approximately 14 minutes. I would definitely like to make it a lot faster than that and would appreciate any help or suggestions.

         public static void RemoveDuplicateEntriesinFile(string filepath)
        {
              if (filepath == null)
                    throw new ArgumentException("Please provide a valid FilePath");
              String[] lines = File.ReadAllLines(filepath);
              for (int i = 0; i < lines.Length; i++)
              {
                    for (int j = (i + 1); j < lines.Length; j++)
                    {
                          if ((lines[i] !=null) && (lines[j]!=null) && lines[i].Equals(lines[j]))
                          {//replace duplicates with null
                                lines[j] = null;
                          }
                    }
              }

              File.WriteAllLines(filepath, lines);
        }

Thanks in Advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T17:59:27+00:00

“As I expect the file to be large, I did not use a HashMap to store the lines but opted instead for a String array.”

I don’t agree with your reasoning; the larger the file, the more of a performance benefit you’ll get from hashing. In your code, you’re comparing each lines with all succeeding lines, requiring O(n²) computational complexity for the whole file.

On the other hand, if you were to use an efficient hashing algorithm, then each hash lookup would complete in O(1); the computational complexity of processing your entire file becomes O(n).

Try using a HashSet<string> and observe the difference in the processing time:

public static void RemoveDuplicateEntriesinFile(string filepath)
{
    if (filepath == null)
        throw new ArgumentException("Please provide a valid FilePath");

    HashSet<string> hashSet = new HashSet<string>(File.ReadLines(filepath));
    File.WriteAllLines(filepath, hashSet);
}

Edit: Could you try the following version of the algorithm and check how long it takes? It’s optimized to minimize memory consumption:

HashAlgorithm hashAlgorithm = new SHA256Managed();
HashSet<string> hashSet = new HashSet<string>();
string tempFilePath = filepath + ".tmp";

using (var fs = new FileStream(tempFilePath, FileMode.Create, FileAccess.Write))
using (var sw = new StreamWriter(fs))
{
    foreach (string line in File.ReadLines(filepath))
    {
        byte[] lineBytes = Encoding.UTF8.GetBytes(line);
        byte[] hashBytes = hashAlgorithm.ComputeHash(lineBytes);
        string hash = Convert.ToBase64String(hashBytes);

        if (hashSet.Add(hash))
            sw.WriteLine(line);
    }
}

File.Delete(filepath);
File.Move(tempFilePath, filepath);

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following code to optimize. As I expect the file to be

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply