I’m developing a documents system that, each time that a new one is created,

Question

0

Asked: June 6, 20262026-06-06T06:24:15+00:00 2026-06-06T06:24:15+00:00

I’m developing a documents system that, each time that a new one is created,

0

I’m developing a documents system that, each time that a new one is created, it has to detect and discard duplicates in a database of about 500.000 records.

For now, I’m using a search engine to retrieve the 20 most similar documents, and compare them with the new one that we’re trying to create. The problem is that I have to check if the new document is similar (that’s easy with similar_text), or even if it’s contained inside the other text, all this operations considering that the text may have been partly changed by the user (here is the problem). How I can do that?

For example:

<?php

$new = "the wild lion";

$candidates = array(
  'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate
  'rhinoceros are native to Africa and three to southern Asia.'
);

foreach ( $candidates as $candidate ) {
  if( $candidate is similar or $new is contained in it) {
       //Duplicated!!
  }
}

Of course, in my system the documents are longer than 3 words 🙂

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T06:24:17+00:00

This is the temporal solution I’m using:

function contained($text1, $text2, $factor = 0.9) {
    //Split into words
    $pattern= '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u';
    $words1 = preg_split($pattern, mb_strtolower($text1), -1, PREG_SPLIT_NO_EMPTY);
    $words2 = preg_split($pattern, mb_strtolower($text2), -1, PREG_SPLIT_NO_EMPTY);

    //Set long and short text
    if (count($words1) > count($words2)) {
        $long = $words1;
        $short = $words2;
    } else {
        $long = $words2;
        $short = $words1;
    }

    //Count the number of words of the short text that also are in the long
    $count = 0;
    foreach ($short as $word) {
        if (in_array($word, $long)) {
            $count++;
        }
    }

    return ($count / count($short)) > $factor;
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m developing a documents system that, each time that a new one is created,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply