I have a list of strings (company names, in this case), and a Java

Question

0

Asked: May 15, 20262026-05-15T04:51:56+00:00 2026-05-15T04:51:56+00:00

I have a list of strings (company names, in this case), and a Java

0

I have a list of strings (company names, in this case), and a Java program that extracts a list of things that look like company names out of mostly-unstructured text. I need to match each element of extracted text to a string in the list. Caveat: the unstructured text has typos, things like “Blah, Inc.” referred to as “Blah,” etc. I’ve tried Levenshtein Edit Distance, but that fails for predictable reasons. Are there known best-practices ways of tackling this problem? Or am I back to manual data-entry?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T04:51:57+00:00

This is not a simple problem, and there are entire companies built around trying to solve it (even for reduced matching sets like company names versus the general case).

If you can identify a discrete number of patterns that valid company names fall into, and that noise does not fall into, then you could tackle this with a series of regular expression matches.

If the patterns are difficult or too numerous, then you could try developing a probabilistic model, perhaps something like a Bayesian network. You would take a subset of your data for training, and perhaps a second subset for a quick validation, and grow the network. Techniques might include genetic programming or setting up a neural network. This approach is obviously not lightweight, and you’d probably want to consider your need carefully before going down this road.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a list of strings (company names, in this case), and a Java

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply