I’m trying to recognize a BOM for UTF-8 when reading a file. Of course, Java files like to deal with 16 bit chars, and the BOM characters are eight bit bytes.
My test code looks like:
public void testByteOrderMarks() {
System.out.println("test byte order marks");
byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
String test = new String(bytes, Charset.availableCharsets().get("UTF-8"));
System.out.printf("test len: %s value %s\n", test.length(), test);
String three = test.substring(0,3);
System.out.printf("len %d >%s<\n", three.length(), three);
for (int i = 0; i < test.length();i++) {
byte b = bytes[i];
char c = test.charAt(i);
System.out.printf("b: %s %x c: %s %x\n", (char) b, b, c, (int) c);
}
}
and the result is:
test byte order marks
test len: 4 value ?abc
len 3 >?ab<
b: ? ef> c: ? feff
b: ? bb c: a 61
b: ? bf c: b 62
b: a 61 c: c 63
I can’t figure out why the length of “test” is 4 and not 6.
I can’t figure out why I don’t pick up each 8 bit byte to do the comparison.
Thanks
A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character
'\uFEFF'. There is no need to delve into bytes. Just read the first character of the file, and if it matches'\uFEFF'it is the BOM. If it doesn’t match then the file was written without a BOM.