I am looking for an algorithm, preferably in Python that would help me locate substrings, N characters long, of exisiting strings that are closest to a target string N character long.
Consider the target string, that is, say, 4 characters long, to be:
targetString -> '1111'
Assume this is the string I have available with me ( I will generate substrings of this for “best alignment” matching ):
nonEmptySubStrings -> ['110101']
Substrings of the above that are 4 characters long:
nGramsSubStrings -> ['0101', '1010', '1101']
I want to write/use a “Magic Function” that would select the string closest to targetString :
someMagicFunction -> ['1101']
Some more examples:
nonEmptySubStrings -> ['101011']
nGramsSubStrings -> ['0101', '1010', '1011']
someMagicFunction -> ['1011']
nonEmptySubStrings -> ['10101']
nGramsSubStrings -> ['0101', '1010']
someMagicFunction -> ['0101', '1010']
Is this “Magic Function” a well known substring problem?
I really want to find the min. number of changes in nonEmptySubStrings so that it would have targetString as a substring.
Base on OP’s comment to question, this is what is desired
This will return the minimum edit distance of any substring to the target string. It will not indicate which string that is or what its index is. It could be easily modified to do
so though.
The naive way, which can be the best way, is
This wont return the index at which the substring occurs though. Of course you didn’t specify that you need it in your question 😉
If you want to get better than this, it will depend on how you’re measuring the distance and will basically boil down to avoiding checking some substrings by infering that you would have to change at least x chars to get a better match than you already have. At that point, you might as well just change x chars by jumping ahead x chars.