I got a wordlist which is 56GB and I would like to remove doubles.
I’ve tried to approach this in java but I run out of space on my laptop after 2.5M words.
So I’m looking for an (online) program or algorithm which would allow me to remove all duplicates.
Thanks in advance,
Sir Troll
edit:
What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated
I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing ‘a’ into a.txt, first char equals ‘b’ into b.txt. …
–
afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.
if the files remain to big you can also split using more than 1 char
e.g: