I have a message composed of closely matching strings . I want to remove closely matching strings from message . By close i mean like if two strings match upto 80% of their total length , then one of them should be considered duplicate and removed.
The distinct() method from System.Linq; or similar stratedgy wont work like I have implemented below because even of one non matching character.
string[] masg = {"Hello World","Hello World One","Hello-World","How are you","How are u"};
var distinctStr = masg.Distinct();
masg="";
foreach(string str in distinctStr)
masg+=str+"~";
Desired Output
Hello World~How are you
How to do it . Please provide me ideas or further concepts that I should go through . thanks.
What you need to do first is define a distance between to strings, for example using the Levenshtein distance. After that, you just need to go through the strings, adding them to a set as long as the set does not contain another string with a distance lower than what you desire.