What is the most memory efficient way to remove duplicate lines in a large text file using C++?
Let me clarify, I’m not asking for code, just the best method. The duplicate lines are not guaranteed to be adjacent. I realize that an approach optimized for minimal memory usage would result in slower speeds however this is my restriction as the files are far too large.
i would hash each line and then seek back to lines that have non-unique hashes and compare them individually (or in a buffered fashion). this would work well on files with a relatively low occurence of duplicates.
When you use a hash, you can set the memory used to a constant amount (i.e., you could have a tiny hash table with just 256 slots or something larger. in any case, the amount of mem can be restricted to any constant amount.) the values in the table are the offset of the lines with that hash. so you only need line_count*sizeof(int) plus a constant to maintain the hash table.
even simpler (but much slower) would be to scan the entire file for each line. but i prefer the first option. this is the most memory efficient option possible. you would only need to store 2 offsets and 2 bytes to do the comparison.