I have a XML file encoded in UTF-8. When I open it in Java, some(in theory valid) characters remain encoded. For example, I try to get the 𐌰 character:
String str = new String(line.getBytes("UTF-8"));
System.out.println(str.charAt(pos));
where pos is the position where it should be.
I get instead the & character.
When I open it with Notepad++ and make sure it encodes UTF-8, I get the same problem.
To my mind, there should be two ways: getting from the beginning only codes(no characters) or replacing all codes with characters.
What should I do and how?
Please don’t construct a String from a byte array without specifying a charset, thats alway a sign of a problem.
if the
charAtreturns the ampersand character then you are either not using an xml parser to load the file or the character is double encoded like&66352;.The character 66352 won’t fit into Java’s 16 bit char datatype and so gets encoded as two surrogate characters in a String. You should use the
codePointAtmethod in this case.