I got a wordlist which is 56GB and I would like to remove doubles.

Question

0

Asked: May 23, 20262026-05-23T06:19:54+00:00 2026-05-23T06:19:54+00:00

I got a wordlist which is 56GB and I would like to remove doubles.

0

I got a wordlist which is 56GB and I would like to remove doubles.
I’ve tried to approach this in java but I run out of space on my laptop after 2.5M words.
So I’m looking for an (online) program or algorithm which would allow me to remove all duplicates.

Thanks in advance,
Sir Troll

edit:
What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T06:19:55+00:00

I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing ‘a’ into a.txt, first char equals ‘b’ into b.txt. …

a.txt
b.txt
c.txt
–

afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.

if the files remain to big you can also split using more than 1 char
e.g:

aa.txt
ab.txt
ac.txt
…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I got a wordlist which is 56GB and I would like to remove doubles.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply