I need to strip out a few invalid characters from a string and wrote the following code part of a StringUtil library:
public static String removeBlockedCharacters(String data) {
if (data==null) {
return data;
}
return data.replaceAll("(?i)[<|>|\u003C|\u003E]", "");
}
I have a test file illegalCharacter.txt with one line in it:
hello \u003c here < and > there
I run the following unit test:
@Test
public void testBlockedCharactersRemoval() throws IOException{
checkEquals(StringUtil.removeBlockedCharacters("a < b > c\u003e\u003E\u003c\u003C"), "a b c");
log.info("Procesing from string directly: " + StringUtil.removeBlockedCharacters("hello \u003c here < and > there"));
log.info("Procesing from file to string: " + StringUtil.removeBlockedCharacters(FileUtils.readFileToString(new File("src/test/resources/illegalCharacters.txt"))));
}
I get:
INFO - 2010-09-14 13:37:36,111 - TestStringUtil.testBlockedCharactersRemoval(36) | Procesing from string directly: hello here and there
INFO - 2010-09-14 13:37:36,126 - TestStringUtil.testBlockedCharactersRemoval(37) | Procesing from file to string: hello \u003c here and there
I am VERY confused: as you can see, the code properly strips out the ‘<‘, ‘>’, and ‘\u003c’ if I pass a string containing these values but it fails to strip out ‘\u003c’ if I read from a file containing the same string.
My questions, so that I stop loosing hair over it, are:
- Why do I get this behavior?
- How can I change my code to properly strip \u003c in all occasions?
Thanks
When you compile your source file, the very first thing that happens–before any lexing or parsing–is that the Unicode escapes,
\u003Cand\u003E, get converted to the actual characters,<and>. So your code is really:When you compile the code for the test against the string literal, the same thing happens; the test string that you wrote as:
…is really:
But when you read the test string from a file, no such conversion occurs; you end up trying to match the six-character sequence
\u003cwith the single character,<. If you really want to match\u003Cand\u003E, your code should look like this:If you use one backslash, the Java compiler interprets it as a Unicode escape and converts it to
<or>.If you use two backslashes, the regex compiler interprets it as a Unicode escape and thinks you want to match a
<or>.If you use three backslashes, the Java compiler turns it into
\<or\>, the regex compiler ignores the backslash, and it tries to match<or>.So, to match a raw Unicode escape sequence, you have to use four backslashes to match the one backslash in the escape sequence.
Notice that I changed your brackets, too.
[<|>]is a character class that matches<,|or>; what you want is an alternation.