I have a simple method to compare an array of FileInfo objects against a list of filenames to check what files have been already been processed. The unprocessed list is then returned.
The loop of this method iterates for about 250,000 FileInfo objects. This is taking an obscene amount of time to compete.
The inefficiency is obviously the Contains method call on the processedFiles collection.
First how can I check to make sure my suspicion is true about the cause and secondly, how can I improve the method to speed the process up?
public static List<FileInfo> GetUnprocessedFiles(FileInfo[] allFiles, List<string> processedFiles)
{
List<FileInfo> unprocessedFiles = new List<FileInfo>();
foreach (FileInfo fileInfo in allFiles)
{
if (!processedFiles.Contains(fileInfo.Name))
{
unprocessedFiles.Add(fileInfo);
}
}
return unprocessedFiles;
}
A
List<T>‘sContainsmethod runs in linear time, since it potentially has to enumerate the entire list to prove the existence / non-existence of an item. I would suggest you use aHashSet<string>or similar instead. AHashSet<T>‘sContainsmethod is designed to run in constantO(1)time, i.e it shouldn’t depend on the number of items in the set.This small change should make the entire method run in linear time:
I would suggest 3 improvements, if possible:
ISet<T>as a parameter. This way, you won’t have to reconstruct the set every time.stringandFileInfo) in this fashion. Pick one and go with it.HashSet<T>.ExceptWithmethod instead of doing the looping yourself. Bear in mind that this will mutate the collection.If you can use LINQ, and you can afford to build up a set on every call, here’s another way: