I have come across a problem of matching a string in an OCR recognized text and find the position of it considering there can be arbitrary tolerance of wrong, missing or extra characters. The result should be a best match position, possibly (not necessarily) with length of matching substring.
For example:
String: 9912, 1.What is your name?
Substring: 1. What is your name?
Tolerance: 1
Result: match on character 7
String: Where is our caat if any?
Substring: your cat
Tolerance: 2
Result: match on character 10
String: Tolerance is t0o h1gh.
Substring: Tolerance is too high;
Tolerance: 1
Result: no match
I have tried to adapt Levenstein algorithm, but it doesn’t work properly for substrings and doesn’t return position.
Algorithm in Delphi would be preferred, yet any implementation or pseudo logic would do.
Here’s a recursive implementation that works, but might not be fast enough. The worst case scenario is when a match can’t be found, and all but the last char in “What” gets matched at every index in Where. In that case the algorithm will make Length(What)-1 + Tolerance comparasions for each char in Where, plus one recursive call per Tolerance. Since both Tolerance and the length of What are constnats, I’d say the algorithm is O(n). It’s performance will degrade linearly with the length of both “What” and “Where”.
I’ve used the following code to test the function:
For case:
it shows a match on character 9, of length 6. For the other two examples it gives the expected result.