Is there a fast algorithm for finding the Largest Common Substring in two strings or is it an NPComplete problem?
In PHP I can find a needle in a haystack:
<?php if (strstr('there is a needle in a haystack', 'needle')) { echo 'found<br>\n'; } ?>
I guess I could do this in a loop over one of the strings but that would be very expensive! Especially since my application of this is to search a database of email and look for spam (i.e. similar emails sent by the same person).
Does anyone have any PHP code they can throw out there?
I have since found a relevant wikipedia article. It is not a NP complete problem, it can be done in O(mn) time using a dynamic programming algorithm.
In PHP I found the similar_text function very useful. Here’s a code sample to retrieve a series of text emails and loop through them and find ones that are 90% similar to each other. Note: Something like this is NOT scalable: