After finishing two cs classes, I’ve started working on a personal project in Java. I am writing a program that will look through a music collection and attempt to set the ‘Composer’ tag by looking at the filename and meta tags. I am comparing these against a composer list I have created as a simple text file. My question is this:
What is a good method for comparing two strings to try to find a best match of sorts? For exammple, in my case suppose I have a file called ‘Pulenc – Gloria in excelsis Deo.flac’. In my composer list I have ‘Poulenc, Francis’. I want to be able to read ‘Pulenc’, and see that it is very close to ‘Poulenc’ so that I can have the composer tag set correctly. A friend suggested I look into using Cosine Distance (which I’d never heard of before), and another recommended Levenshtein Distance. Are either of these a good approach or are there other methods that may work better?
It sounds like the Levenshtein Distance is exactly what you need. The Cosine Distance seems to deal with longer texts, and phonetic algorithms like Soundex will probably yield poor results for names, most of which are not intended to be pronounced using English pronounciation rules.