I have an ITunes library XML file backup file – about 15 MB.
I have 20K music files on my C drive and about 25K files on E drive under exactly similar folder structures.
I am traversing the first location and going file by file and checking if the file exiss in the second location. That part works for me.
Now, for all such duplicate files, if the file path from E drive exists in the XML, but the C drive path does not exist in the XML, then I want to delete the file from the C drive.
What is my best way of checking if a string exists in the XML file (I have to do this atleast 20K times)?
Alphabetically sort your list of strings that you’re matching on, then build an index array which tells you where the start of your list is for each character that is a starting character for one of the strings, maybe indexing to the second character depending on the breadth of variety and if your match is case sensitive or not.
Read the file character by character with a stream to minimize memory footprint, checking into the index array to see where that character starts and ends in the list of strings so you can pull out that characters page, if there’s anything starting with those character combinations. Then continue filtering inside of the page until you have one match left and the next character makes matches 0.
Remove that string from the list of strings to match, put it in another list if you want. Then start checking your index on the next character and continue doing so each time you run into no matches.
The index gives you a more efficient aggregate to minimize number of items iterated against.
This could give you a two character depth index:
Then to find the starting index in your list you just access: