I have two text files files (TXT) which contain over 2 million distinct file names. I want to loop through all the names in the first file and find those that are also present in the second text file.
I have tried looping through the StreamReader but it takes a lot of time. I also tried the code below, but it still takes too much time.
StreamReader first = new StreamReader(path);
string strFirst = first.ReadToEnd();
string[] strarrFirst = strFirst.Split('\n');
bool found = false;
StreamReader second = new StreamReader(path2);
string str = second.ReadToEnd();
string[] strarrSecond = str.Split('\n');
for (int j = 0; j < (strarrFirst.Length); j++)
{
found = false;
for (int i = 0; i < (strarrSecond .Length); i++)
{
if (strarrFirst[j] == strarrSecond[i])
{
found = true;
break;
}
}
if (!found)
{
Console.WriteLine(strarrFirst[j]);
}
}
What is a good way to compare the files?
How about this:
That’s O(N + M) instead of your current solution which tests every line in the first file with every line in the second file – O(N * M).
That’s assuming you’re using .NET 4. Otherwise, you could use
File.ReadAllLines, but that will read the whole file into memory. Or you could write the equivalent ofFile.ReadLinesyourself – it’s not terribly hard.Ultimately you’re likely to be limited by file IO by the time you’ve got rid of the O(N * M) problem in your current code – there’s not much way to get round that.
EDIT: For .NET 2, first let’s implement something like
ReadLines:Now we really want to use a
HashSet<T>, but that wasn’t in .NET 2 – so let’s useDictionary<TKey, TValue>instead: