so I’m building this collaborative filtering system using Weka’s machine learning library JAVA API…
I basically use the StringToWordVector filter to convert string objects into their word occurence decomposition….
so now I’m using kNN algorithm to find the nearest neighbors to a target object….
my question is, what distance function should I use to to compute distance between two objects that has been filtered by the StringToWordVector filter…which one woud be most effective for this scenario?
the available options in Weka are:
AbstractStringDistanceFunction, ChebyshevDistance, EditDistance, EuclideanDistance, ManhattanDistance, NormalizableDistance
Yes similarity metrics are good times. Short answer is that you should try them all and optimize with respect to RMSE, MAE, breadth of return set, etc.
There seems to be a distinction between Edit distance and the rest of these metrics as I would expect an EditDistance algorithm to work on strings themselves.
How does your StringToWordVector work? First answer this question, and then use that answer to fuel thoughts like: what do I want a similarity between two words to mean in my application (does semantic meaning outweigh word-length for instance).
And as long as you’re using a StringVectorizer, it would seem you’re free to consider more mainstream similarity metrics like LogLikelihood, Pearson, and Cosine (respectively). I think this is worth doing as none of the similarity metrics you’ve listed are widely used or studied seriously in the literature to my knowledge.
May the similarity be with you!