I am currently trying to build a small system that read’s in a bunch of file names (at the moment, only a few hundred), and then allows the user to search the file names. The end goal is to find dulicates, ones that will not necassarily have the exact same names, but will share common words. I would eventually like to add a feature that allows it to suggest possible duplicates as well.
Currently I add each file path to an ArrayList, and then pass each word of the file name to a Hashtable which uses chaining. The words are created using String.split(), and all non alphanumeric characters are converted into white spaces. This part works fine, and you can search for single word’s no worries.
I know the theory behind searching multiple terms, getting the response and building a basic relevance on how many time it selects each document.
My current issue is with file names that are something akin to this ‘mybestfile’. My program can only handle them as a single word. and unless searching for ‘mybestfile’ you will find nothing.
Can anyone suggest a design path that I should head down from here. I know I could parse in an entire dictionary, then try and pull words out by matching substrings, but to be honest, this is just meant to be a simplistic program and I’d rather avoid that kind of thing.
Any help would be appreciated!!
(Also the point of this is half learning, half proving I can do it, so I would like to know of solutions that already exist, but more for how they did it, rather then using them instead)
You could start by playing with various “sounds like” and distance algorithms, available in the Apache Codec language package. (I think the distance algo is in Commons Lang, not codec.)
SimMetrics is another. Can’t actually find the one I’m looking for, but here’s a list, too.