I’m using a framwork which returns malformed Strings with “empty” characters from time to time.
“foobar” for example is represented by:
[,f,o,o,b,a,r]
The first character is NOT a whitespace (‘ ‘), so a System.out.printlin() would return “foobar” and not ” foobar”. Yet, the length of the String is 7 instead of 6. Obviously this makes most String methods (equals, split, substring,..) useless. Is there a way to remove empty characters from a String?
I tried to build a new String like this:
StringBuilder sb = new StringBuilder();
for (final char character : malformedString.toCharArray()) {
if (Character.isDefined(character)) {
sb.append(character);
}
}
sb.toString();
Unfortunately this doesn’t work. Same with the following code:
StringBuilder sb = new StringBuilder();
for (final Character character : malformedString.toCharArray()) {
if (character != null) {
sb.append(character);
}
}
sb.toString();
I also can’t check for an empty character like this:
if (character == ''){
//
}
Obviously there is something wrong with the String .. but I can’t change the framework I’m using or wait for them to fix it (if it is a bug within their framework). I need to handle this String and sanatize it.
Any ideas?
It’s probably the NULL character which is represented by
\0. You can get rid of it byString#trim().To nail down the exact codepoint, do so:
Then you can find the exact character here.
Update: as per the update:
You can do that with help of regex. See the answer of @polygenelubricants here and this answer.
On the other hand, you can also just fix the problem in its root instead of workarounding it. Either update the files to get rid of the BOM mark, it’s a legacy way to distinguish UTF-8 files from others which is nowadays worthless, or use a
Readerwhich recognizes and skips the BOM. Also see this question.