I’m developing a documents system that, each time that a new one is created, it has to detect and discard duplicates in a database of about 500.000 records.
For now, I’m using a search engine to retrieve the 20 most similar documents, and compare them with the new one that we’re trying to create. The problem is that I have to check if the new document is similar (that’s easy with similar_text), or even if it’s contained inside the other text, all this operations considering that the text may have been partly changed by the user (here is the problem). How I can do that?
For example:
<?php
$new = "the wild lion";
$candidates = array(
'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate
'rhinoceros are native to Africa and three to southern Asia.'
);
foreach ( $candidates as $candidate ) {
if( $candidate is similar or $new is contained in it) {
//Duplicated!!
}
}
Of course, in my system the documents are longer than 3 words 🙂
This is the temporal solution I’m using: