I have an array of several news headlines ( just strings ) that I have retrieved from multiple news sources ( some that my company pays for ). Often the headlines are similar, but do not match word for word. I would like to try and bucket them similarly to how google news does it.
Is there an algorithm out there to do this? I can use ruby or python for this script.
Thanks!
For Ruby, look at the text gem, specifically the Levenshtein distance between two strings.