I have written following algorithm into C# code to list down the files inside a folder recursively.
- Begin Iterating through the list of files in the directory & its sub
directories. - Store file Name & Path in a list.
- If current file matches any other file in the list, during
mark both files as duplicate. - Fetch all files from the list which were marked duplicate.
- Group them by name & return.
The implementation is very slow on a folder containing 50,000 files and 12,000 sub directories. As disk read operation is basically time consuming task. Even LINQ.Parallel() doesn’t help much.
Implmentation:
class FileTuple { public string FileName { set; get; } public string ContainingFolder { set; get; } public bool HasDuplicate { set; get; } public override bool Equals(object obj) { if (this.FileName == (obj as FileTuple).FileName) return true; return false; } }
- FileTuple class keeps track of filenames & containing directory, the
flag keeps track of duplicate status. - I have overridden the equals method to compare only files names, in
the collection of fileTuples.
Following method finds the duplicate files and return as a list.
private List<FileTuple> FindDuplicates()
{
List<FileTuple> fileTuples = new List<FileTuple>();
//Read all files from the given path
List<string> enumeratedFiles = Directory.EnumerateFiles(txtFolderPath.Text, "*.*", SearchOption.AllDirectories).Where(str => str.Contains(".exe") || str.Contains(".zip")).AsParallel().ToList();
foreach (string filePath in enumeratedFiles)
{
var name = Path.GetFileName(filePath);
var folder = Path.GetDirectoryName(filePath);
var currentFile = new FileTuple { FileName = name, ContainingFolder = folder, HasDuplicate = false, };
int foundIndex = fileTuples.IndexOf(currentFile);
//mark both files as duplicate, if found in list
//assuming only two duplicate file
if (foundIndex != -1)
{
currentFile.HasDuplicate = true;
fileTuples[foundIndex].HasDuplicate = true;
}
//keep of track of the file navigated
fileTuples.Add(currentFile);
}
List<FileTuple> duplicateFiles = fileTuples.Where(fileTuple => fileTuple.HasDuplicate).Select(fileTuple => fileTuple).OrderBy(fileTuple => fileTuple.FileName).AsParallel().ToList();
return duplicateFiles;
}
Can you please suggest a way to improve the performance.
Thank you for your help.
Well one obvious improvement would be to use a
Dictionary<FileTuple, FileTuple>as well as aList<FileTuple>. That way you wouldn’t have an O(N)IndexOfoperation on each check. Note that you’ll also need to overrideGetHashCode()– you should already have a warning about this.I doubt that it’ll make very much difference though – I’d expect this to be mostly IO-bound.
Additionally, I doubt that the filtering and ordering at the end is going to be a significant bottleneck, so using the
AsParallelin the final step isn’t likely to do much. Of course, you should measure all of this.Finally, the whole method can be made rather simpler, without even needing the
HasDuplicateflag or any overriding ofEquals/GetHashCode: