I have a String which contains formatted Decimal values like 45,890.00, 1.5v 2,12g, etc. Additionaly, it contains special characters, HTML entitites (escaped and unescaped) UTF-8 encoded characters, etc all in one line. While I’ve managed to cleanup the entities, I’m still struggling to come up witha way to make sure that splitting on spaces or punctuation doesn’t split a number which is delimited by a comma or period.
Example String:
> String original_str =
> "a,b;c.d+e-f/g\h*i~j=k?l$m 1.5 1,5 1.5v 1,5v 1255,456.78 & 6<7 & 6>5 ق für; {AGB's;} ([für]); ";
expected Output:
a
b
c
etc
1.5
1,5
1.5v
1,5v
1255,456.78
6<7
6>5
ق
für
AGB’s
für
Number formats can be: x.x OR xxx,xxxx.xxxx,xxxx seperated by COMMA | DOT | MIXED
After cleaning entities out of the String, I try to split it by a list of punctuation characters and spaces, but how do I keep decimal-like-keywords (1,5 1.5v 22,33.66 ..etc) while splitting by commas and periods?
Use a regex with the pattern
That will split anything that doesn’t have a number on both sides of a period or comma, or any other punctuation that isn’t a period or comma. The 3rd section between pipes covers any spaces. That last part is based on a negative lookahead which is discussed in this answer to prevent the already matched commas and periods that we kept safe from splitting numbers from being matched here.