I am trying to effeciently remove duplicate rows from relatively large (several hundred MB) CSV files that are not ordered in any meaningful way. Although I have a technique to do this, it is very brute force and I am certain there is a moe elegant and more effecient way.
I am trying to effeciently remove duplicate rows from relatively large (several hundred MB)
Share
In order to remove duplicates you will have to have some sort of memory that tells you if you have seen a line before. Either by remembering the lines or perhaps a checksum of them (which is almost safe…)
Any solution like that will probably have a “brute force” feel to it.
If you could have the lines sorted before processing them, then the task is fairly easy as duplicates would be next to each other.