Hey, I’m using Levenshteins algorithm to get distance between source and target string.
also I have method which returns value from 0 to 1:
/// <summary> /// Gets the similarity between two strings. /// All relation scores are in the [0, 1] range, /// which means that if the score gets a maximum value (equal to 1) /// then the two string are absolutely similar /// </summary> /// <param name='string1'>The string1.</param> /// <param name='string2'>The string2.</param> /// <returns></returns> public static float CalculateSimilarity(String s1, String s2) { if ((s1 == null) || (s2 == null)) return 0.0f; float dis = LevenshteinDistance.Compute(s1, s2); float maxLen = s1.Length; if (maxLen < s2.Length) maxLen = s2.Length; if (maxLen == 0.0F) return 1.0F; else return 1.0F - dis / maxLen; }
but this for me is not enough. Because I need more complex way to match two sentences.
For example I want automatically tag some music, I have original song names, and i have songs with trash, like super, quality, years like 2007, 2008, etc..etc.. also some files have just http://trash..thash..song_name_mp3.mp3, other are normal. I want to create an algorithm which will work just more perfect than mine now.. Maybe anyone can help me?
here is my current algo:
/// <summary> /// if we need to ignore this target. /// </summary> /// <param name='targetString'>The target string.</param> /// <returns></returns> private bool doIgnore(String targetString) { if ((targetString != null) && (targetString != String.Empty)) { for (int i = 0; i < ignoreWordsList.Length; ++i) { //* if we found ignore word or target string matching some some special cases like years (Regex). if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true; } } return false; } /// <summary> /// Removes the duplicates. /// </summary> /// <param name='list'>The list.</param> private void removeDuplicates(List<String> list) { if ((list != null) && (list.Count > 0)) { for (int i = 0; i < list.Count - 1; ++i) { if (list[i] == list[i + 1]) { list.RemoveAt(i); --i; } } } } /// <summary> /// Does the fuzzy match. /// </summary> /// <param name='targetTitle'>The target title.</param> /// <returns></returns> private TitleMatchResult doFuzzyMatch(String targetTitle) { TitleMatchResult matchResult = null; if (targetTitle != null && targetTitle != String.Empty) { try { //* change target title (string) to lower case. targetTitle = targetTitle.ToLower(); //* scores, we will select higher score at the end. Dictionary<Title, float> scores = new Dictionary<Title, float>(); //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\'', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_' List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); //* remove all trash from keywords, like super, quality, etc.. targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); }); //* sort keywords. targetKeywords.Sort(); //* remove some duplicates. removeDuplicates(targetKeywords); //* go through all original titles. foreach (Title sourceTitle in titles) { float tempScore = 0f; //* split orig. title to keywords list. List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); sourceKeywords.Sort(); removeDuplicates(sourceKeywords); //* go through all source ttl keywords. foreach (String keyw1 in sourceKeywords) { float max = float.MinValue; foreach (String keyw2 in targetKeywords) { float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2); if (currentScore > max) { max = currentScore; } } tempScore += max; } //* calculate average score. float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); //* if average score is bigger than minimal score and target title is not in this source title ignore list. if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle)) { //* add score. scores.Add(sourceTitle, averageScore); } } //* choose biggest score. float maxi = float.MinValue; foreach (KeyValuePair<Title, float> kvp in scores) { if (kvp.Value > maxi) { maxi = kvp.Value; matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic); } } } catch { } } //* return result. return matchResult; }
This works normally but just in some cases, a lot of titles which should match, does not match… I think I need some kind of formula to play with weights and etc, but i can’t think of one..
Ideas? Suggestions? Algos?
by the way I already know this topic (My colleague already posted it but we cannot come with a proper solution for this problem.): Approximate string matching algorithms
Your problem here may be distinguishing between noise words and useful data:
You may need to produce a dictionary of noise words to ignore. That seems clunky, but I’m not sure there’s an algorithm that can distinguish between band/album names and noise.