I have a large database of sentences, and a problem where sentences like “i’m good” do not match to “im good” and vise versa or “is that mine?” not matching with “is that mine” and vise versa when i would want them to be detected as a match.
I had made complicated and messy functions trying to do this with wildcards and researching but its just a big mess. and im sure there must be a way to search with this 1 character lee way. If i can i would like to control which characters get this lee way, like in my examples the main problem causers are the question mark and the half quote. (? ‘).
im currently using a plane select query with php and mysql to do the matching queries.
i would love some help to figure this out so i can clean up the big mess of code that is currently doing the job inconsistently.
in case anyone wants to see the code query checking for matches is like this:
$checkqwry = "select * from `eng-jap` where (eng = '$eng' or english = '$oldeng' or english = '$oldeng2') and (jap = '$jap' or japanese = '$oldjap' or japanese = '$oldjap2');";
the purpose of the query is to just check if there is already a translation with the $eng and $jap already in the DB. the reason you see $oldeng $oldeng2 and $oldeng3 and so on is like i said, my messy foolish attempts to match even if there is or is not a question mark and so on. where some of the $oldeng variables have questions marks or halfquotes and so on and the others dont. there is more code above appending and remove question marks and stuff. yes its a big mess.
You want to use a String Metric algorithm as mentioned above, PHP has this function built in http://php.net/manual/en/function.levenshtein.php as well as http://www.php.net/manual/en/function.similar-text.php.
MySQL doesn’t implement this (specific algorithm) natively but some people have went ahead and wrote stored procedures to accomplish the same: http://www.artfulsoftware.com/infotree/queries.php#552
In my opinion using a String Metric that can handle arbitrary changes is better then stripping out punctuation, and can also catch omissions, transpositions, etc…