I need a java lib that will compare 2 different texts with some similarities and tell me if they’re related or not.
For example, I would compare one of these
a) “COMP 150.00 MG X 20.00 UN”
b) “COMP 150.00 MG X 60.00 UN”
with this one
c) “150 mg comp.rec.x 20”
and the lib should tell me that the first one corresponds and the second doesn’t because a) and c) are both mentioning a medicine which is presented in “150mg capsules and the package brings 20 units” and b) refers to a 60 unit pack..
Another thought I had was about regular expressions, but I’m not quite into them so that’s why I’m asking for your help.
Thanks in advance.
If the text variants are always structed in the same way, regular expressions could be one way to solve this. Basically you’d check each text against a set of expressions and see whether they match or not. Depending on how much the variants differ the expressions could be simple or might need to be more complex.
For the case above, the first expression could look like this:
COMP 150.00 MG X 20.00 UN->(identifier) (capsule weight) X (num units)From this the following expression could be derived:
^COMP (\d+(?:\.\d+)?) MG X ([\d]+(?:\.\d+)?) UN$(this assumes that the number of spaces are always equal and that you always use
MGandUN).The second expression:
150 mg comp.rec.x 20->(capsule weight) comp.rec.x (num packages)The following expression could be derived:
^(\d+(?:\.\d+)?) mg comp\.rec\.x (\d+(?:\.\d+)?)$You’ll see that both expressions contain the following part twice:
([\d]+(?:\.\d+)?)Those parts capture numbers into a group and allow you to then parse that text into a
Double, for example.Here’s a short breakdown of that sub-expression:
( ... )is a capturing group, i.e. you can access the part that matches that group directly\d+means one or more digits\.is the literal dot(?: ... )is a non capturing group, i.e. you can apply quantifiers but can’t access the matched parts directlyFrom the above parts you get the following:
(?:\.\d+)?means at most one dot followed by at least one digit. This would match.123but not.1.2.3or1.(\d+(?:\.\d+)?)means at least one digit, optionally followed by a dot which is followed by at least one more digit. This would match1.23,12.3or123but not1.,.2or1.2.3.If you have those expressions, apply the correct one on the text (if you know it, otherwise test first) and extract both groups. Then compare the values of those groups.
Note: don’t forget that in Java strings you have to escape backslashes, thus
\dwould be written as"\\d"etc.