Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 700835
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T03:32:33+00:00 2026-05-14T03:32:33+00:00

What algorithm would you suggest to identify how much from 0 to 1 (float)

  • 0

What algorithm would you suggest to identify how much from 0 to 1 (float) two texts are identical?

Note that I don’t mean similar (ie, they say the same thing but in a different way), I mean exact same words, but one of the two texts could have extra words or words slightly different or extra new lines and stuff like that.

A good example of the algorithm I want is the one google uses to identify duplicate content in websites (X search results very similar to the ones shown have been omitted, click here to see them).

The reason I need it is because my website has the ability for users to post comments; similar but different pages currently have their own comments, so many users ended up copy&pasting their comments on all the similar pages. Now I want to merge them (all similar pages will “share” the comments, and if you post it on page A it will appear on similar page B), and I would like to programatically erase all those copy&pasted comments from the same user.

I have quite a few million comments but speed shouldn’t be an issue since this is a one time thing that will run in the background.

The programming language doesn’t really matter (as long as it can interface to a MySQL database), but I was thinking of doing it in C++.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T03:32:33+00:00Added an answer on May 14, 2026 at 3:32 am

    Would the Longest Common Subsequence algorithm fill the bill? It’s basically what diff uses. There’s a dynamic programming algorithm that allows you to solve such problems efficiently. The Wikipedia page I linked to has all the information you need.

    To experiment with it in a nice and friendly way, you can use the Python difflib module which implements it. It contains a difflib.SequenceMatcher class that has a ratio method, which:

    Return a measure of the sequences’
    similarity as a float in the range [0,
    1].

    Where T is the total number of
    elements in both sequences, and M is
    the number of matches, this is 2.0*M /
    T. Note that this is 1.0 if the
    sequences are identical, and 0.0 if
    they have nothing in common.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 457k
  • Answers 457k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer One-liner: int condition = strspn(p, " ") == strlen(p); Slightly… May 15, 2026 at 10:51 pm
  • Editorial Team
    Editorial Team added an answer I like stristr the best because it is case insensitive.… May 15, 2026 at 10:51 pm
  • Editorial Team
    Editorial Team added an answer The approach for Mac OS X is not much different… May 15, 2026 at 10:51 pm

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.