I have 200,000 strings. I need to find the similar strings among that set.

Question

0

Asked: June 16, 20262026-06-16T19:31:04+00:00 2026-06-16T19:31:04+00:00

I have 200,000 strings. I need to find the similar strings among that set.

0

I have 200,000 strings. I need to find the similar strings among that set. I expect the number of similar strings to be very low in the set. Please help out with an efficient data structure.

I can use a simple hash if I am looking for exact matching strings. But, ‘similarity’ is custom defined in my case: two strings are treated similar if 80% of the chars in them are same, order does not matter.

I don’t want to call the function finding “similarity” ~(200k*100k) times. Any suggestions like techniques to preprocess the strings, efficient data structures are welcome. Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T19:31:06+00:00

I learnt that >=0.85 distance ratio is possible only if the string-length difference between two strings is <=3. That means, we can group the strings with length difference <=3.

This drastically reduced the number of string in each group. So, the number of overall comparisons are reduce to slight less than 50% (of 200k*100k) in my data set.
Moreover, dividing the the data set into multiple small sets helps to do parallel-processing which further reduces the overall runtime.

Reduction percentage might vary with the sample data set, i.e. worst case happens when all the string are with length difference <=3.

[Thanks to Inbar Rose for stimulating this thought]

In my case, the histogram looked as below:

histogram

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have 200,000 strings. I need to find the similar strings among that set.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply