Currently I’m trying to enhance my search algorithm.
For better understanding, here’s the current logic behind it:
we have objects with attached n keywords in db. in database this is solved via 2 tables (Object, Keyword) where the Keyword-table has a FK to Object. When i’m building my searchtrees I create a line-value (ad: remove umlauts, convert to lower-case, …) of all keywords of an object. the same convertion-routine (NormalizeSearchPattern()) is done with the search-patterns. I’m supporting AND-search and keywords with minimum length of 2 characters only!
The search-algorithm is currently a variant of fast-reverse-search (this example is not optimized):
bool IsMatch(string source, string searchPattern)
{
// example:
// source: "hello world"
// searchPattern: "hello you freaky funky world"
// patterns[]: { "hello", "you", "freaky", "funky", "world" }
searchPattern = NormalizeSearchPattern(searchPattern);
var patterns = MagicMethodToSplitPatternIntoPatterns(searchPattern);
foreach (var pattern in patterns)
{
var success = false;
var patternLength = pattern.Length;
var firstChar = pattern[0];
var secondChar = pattern[1];
var lengthDifference = input.Length - patternLength;
while (lengthDifference >= 0)
{
if (source[lengthDifference--] != firstChar)
{
continue;
}
if (source[lengthDifference + 2] != secondChar)
{
continue;
}
var l = lengthDifference + 3;
var m = 2;
while (m < patternLength)
{
if (input[l] != pattern[m])
{
break;
}
l++;
m++;
}
if (m == patternLength)
{
success = true;
}
}
if (!success)
{
return false;
}
}
return true;
}
Normalization is done with (this example is not optimized)
string RemoveTooShortKeywords(string keywords)
{
while (Regex.IsMatch(keywords, TooShortKeywordPattern, RegexOptions.Compiled | RegexOptions.Singleline))
{
keywords = Regex.Replace(keywords, TooShortKeywordPattern, " ", RegexOptions.Compiled | RegexOptions.Singleline);
}
return keywords;
}
string RemoveNonAlphaDigits(string value)
{
value = value.ToLower();
value = value.Replace("ä", "ae");
value = value.Replace("ö", "oe");
value = value.Replace("ü", "ue");
value = value.Replace("ß", "ss");
return Regex.Replace(value, "[^a-z 0-9]", " ", RegexOptions.Compiled | RegexOptions.Singleline);
}
string NormalizeSearchPattern(string searchPattern)
{
var resultNonAlphaDigits = RemoveNonAlphaDigits(searchPattern);
var resultTrimmed = RemoveTooShortKeywords(resultNonAlphaDigits);
return resultTrimmed;
}
So this is pretty straight forward, thus it’s obvious, that I can only cope with variants of source and searchPattern which I’ve implemented in NormalizeSearchPattern() (as mentioned above: umlauts, case-differences, …).
But how should I enhance the algorithm (or NormalizeSearchPattern()) to be non-sensitive when it comes down to:
- singular/plural
- misstyping (eg. “hauserr” <-> “hauser”)
- …
Just to know more about the design:
This app is done in c#, it stores the searchtrees and objects in a static variable (to query the database only once at init), the performance has to be outstanding (currently 500.000 lineValues are queried within less than 300msec).
You might also be interested in a Trigram and Bigram search matching algorithm: