I’m trying to read a (Japanese) file that is encoded as a UTF-16 file.

Question

0

Editorial Team

Asked: June 16, 20262026-06-16T04:38:13+00:00 2026-06-16T04:38:13+00:00

I’m trying to read a (Japanese) file that is encoded as a UTF-16 file.

0

I’m trying to read a (Japanese) file that is encoded as a UTF-16 file.

When I read it using an InputStreamReader with a charset of ‘UTF-16″ the file is read correctly:

try {
        InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
        BufferedReader in = new BufferedReader(read);
        String str;
        while((str=in.readLine())!=null){           
            System.out.println(str);
    }
    in.close();
}catch (Exception e){
    System.out.println(e);
}

However, when I use File Channels and read from a byte array the Strings aren’t always converted correctly:

    File f = new File("JapanTest.txt");
    fis = new FileInputStream(f);
    channel = fis.getChannel();
     MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
     buffer.position(0);
    int get = Math.min(buffer.remaining(), 1024);
    byte[] barray = new byte[1024];
    buffer.get(barray, 0, get);
    CharSet charSet = Charset.forName("UTF-16");
    //endOfLinePos is a calculated value and defines the number of bytes to read
    rowString = new String(barray, 0, endOfLinePos, charSet);               
    System.out.println(rowString);

The problem I’ve found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven’t faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?

More Details:
I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T04:38:15+00:00

The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it’s a continuation byte, it can either backtrack or lose only a single character.

With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.

You can also encode the file as UTF-8.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to read a (Japanese) file that is encoded as a UTF-16 file.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply