Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9078025
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T19:31:04+00:00 2026-06-16T19:31:04+00:00

I have 200,000 strings. I need to find the similar strings among that set.

  • 0

I have 200,000 strings. I need to find the similar strings among that set. I expect the number of similar strings to be very low in the set. Please help out with an efficient data structure.

I can use a simple hash if I am looking for exact matching strings. But, ‘similarity’ is custom defined in my case: two strings are treated similar if 80% of the chars in them are same, order does not matter.

I don’t want to call the function finding “similarity” ~(200k*100k) times. Any suggestions like techniques to preprocess the strings, efficient data structures are welcome. Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T19:31:06+00:00Added an answer on June 16, 2026 at 7:31 pm

    I learnt that >=0.85 distance ratio is possible only if the string-length difference between two strings is <=3. That means, we can group the strings with length difference <=3.

    This drastically reduced the number of string in each group. So, the number of overall comparisons are reduce to slight less than 50% (of 200k*100k) in my data set.
    Moreover, dividing the the data set into multiple small sets helps to do parallel-processing which further reduces the overall runtime.

    Reduction percentage might vary with the sample data set, i.e. worst case happens when all the string are with length difference <=3.

    [Thanks to Inbar Rose for stimulating this thought]

    In my case, the histogram looked as below:

    histogram

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a data set that looks like this: 000 100 200 300 010
I have a large list (over 200,000) of strings that I'd like to compare
I have about 200,000 text files that are placed in a bz2 file. The
I have an objectdatasource that will return a potentially large collection (up to 200,000
I have a 200,000 line file that I'm trying to import, but I get
I have a huge string of raw text that is about 200,000 words long.
I have a requirement to find rows in a table containing 200,000 entries. Some
I have a dictionary of 200,000 items (the keys are strings and the values
I have a CSV file that holds about 200,000 - 300,000 records. Most of
i have a table with about 200,000 records. i want to add a field

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.