I have to remove duplicate strings from extremely big text file (100 Gb+) Since

Question

0

Asked: June 1, 20262026-06-01T00:43:16+00:00 2026-06-01T00:43:16+00:00

I have to remove duplicate strings from extremely big text file (100 Gb+) Since

0

I have to remove duplicate strings from extremely big text file (100 Gb+)

Since in memory duplicate removing is hopeless due to size of data, I have tried bloomfilter but of no use beyond something like 50 millions strings ..

total strings are like 1 trillion+

I want to know what are the ways to solve this problem..

My initial attempt is, dividing the file in to number of sub files , sort each file and then merge all files together…

If you have better solution than this please let me know,

Thanks..

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T00:43:17+00:00

Editorial Team

2026-06-01T00:43:17+00:00Added an answer on June 1, 2026 at 12:43 am

The key concept you are looking for here is external sorting. You should be able to merge sort the whole file using the techniques described in that article and then run through it sequentially to remove duplicates.

If the article is not clear enough have a look at the referenced implementations such as this one.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have to remove duplicate strings from extremely big text file (100 Gb+) Since

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply