I am using the levenstein edit distance to find how similar two strings are. The two strings are as such. The first one is the longer of the two if at all; also it is the non-truncated non-modified string I wish to compare the other too. The second string could be truncated at the end, and missing characters. There can be multiple unique string one and string twos.
I read in the list of second strings and each is contained on a line with this format
“[string two] – $0.00” So it is string two plus a space, a dash, a space, and then a price.
So I have a list of second strings (in the format) and I have two options. Remove the price and the ” – ” or keep it there.
-
If I remove it. I read in each string two and tokenize it with the delimiter “$”. I do not know how long any string two is so I must do a stringtwo.removeAll(“-“) to get rid of the dash and then a .trim() for the white space. Well if there is a dash in string two it will also be removed un-voluntairly. So with this I get either exact strings (levenstein = 0), truncated but still exact strings (strings are the same up to length string one – levenstein), truncated and missing a integer amount of dashes (strings the same in a few places between dashes, and if truncated also missing at the end), or not truncated but missing an integer number of dashes.
-
If I leave it. Still read in each string two and tokenize with delimiter “$”. So now I have this format for string two “[string two] – “. So all levenstein distance will be off by 3. The problem here is if I have a string one Ex. “dog food is yummy” and the string two I try to compare is “dog food is yum – ” the levD = 3 but this is the same levD as if I have the string two “dog food is yummy – “.
As you can see both options yield problems. It seems I cannot overcome these problems in my program to try and match the input list of string twos to my list of string ones.
Can anyone see a better way of doing this, are there any other string comparators that I could use to make this less problematic?
Try this: should truncate the String at the last “-” found in each string while keeping the rest of the string intact.
These String manipulations can be expensive so if you are working with a lot of string you might look into other optimizations.
Also this solutions is brittle because it hardcodes the value to determine where to trim into the code. This can be defined elsewhere and passed in so it can vary.
Once you have that working relatively well and safe, next try and look into StringUtils from Apache which has more extensive String manipulations.