I have a list of strings (company names, in this case), and a Java program that extracts a list of things that look like company names out of mostly-unstructured text. I need to match each element of extracted text to a string in the list. Caveat: the unstructured text has typos, things like “Blah, Inc.” referred to as “Blah,” etc. I’ve tried Levenshtein Edit Distance, but that fails for predictable reasons. Are there known best-practices ways of tackling this problem? Or am I back to manual data-entry?
Share
This is not a simple problem, and there are entire companies built around trying to solve it (even for reduced matching sets like company names versus the general case).
If you can identify a discrete number of patterns that valid company names fall into, and that noise does not fall into, then you could tackle this with a series of regular expression matches.
If the patterns are difficult or too numerous, then you could try developing a probabilistic model, perhaps something like a Bayesian network. You would take a subset of your data for training, and perhaps a second subset for a quick validation, and grow the network. Techniques might include genetic programming or setting up a neural network. This approach is obviously not lightweight, and you’d probably want to consider your need carefully before going down this road.