I started with an InputStreamReader, but this buffered its input, reading more than was required from the input stream (as mentioned in its Java docs). Delving into the source code (java version “1.7.0_147-icedtea”) I got to the sun.nio.cs.StreamDecoder class, which contained the comment:
// In order to handle surrogates properly we must never try to produce
// fewer than two characters at a time. If we're only asked to return one
// character then the other is saved here to be returned later.
So I guess the question becomes “is this true, and if so why?” From my (very basic!) understanding of the 6 charsets required by the JLS, it is always possible to determine the exact number of bytes required to read a single character, so no read-ahead would be necessary.
Background is I had a binary file containing a bunch of data with different encodings (numbers, strings, single byte tokens etc.). The basic format was a repeating set of byte marker (indicating the type of data) followed by optional data if required for that type. The two types containing character data were null-terminated strings and strings with a preceding 2-byte length. So for null terminated strings I thought something like this would do the trick:
String readStringWithNull(InputStream in) throws IOException {
StringWriter sw = new StringWriter();
InputStreamReader isr = new InputStreamReader(in, "UTF-16LE");
for (int i; (i = isr.read()) > 0; ) {
sw.write(i);
}
return sw.toString();
}
But the InputStreamReader read ahead from the buffer, so subsequent read operations on the base InputStream missed data. For my particular case I knew that all characters would be UTF-16LE BMP (sort of UCS-2LE) so I just coded around that, but I’m still interested in the general case above.
Also, I’ve seen InputStreamReader buffering issue which is similar, but does not appear to answer this specific question.
Cheers,
Yes the comment is correct, though possibly a bit obscure in its phraseology.
A UTF-8 encoding of a single Unicode code-point consists of between 1 and 4 bytes; see the Wikipedia UTF-8 examples.. But in some cases, the Unicode code-point cannot be represented as one Java
char. So the decoder potentially has to decode the multi-byte UTF-8 sequence as TWO Javacharvalues … and hold one of them back.It is a bit more complicated than this for variable-length encodings. The decoder reads ahead just enough bytes to form one Unicode code-point. This will be between 1 and 4 bytes for UTF-8, and by examining the bytes it knows when to stop. Then it decodes the bytes as 1 or 2 UTF-16 code-units (i.e. Java
charvalues), delivers the first one, and saves the second one.So you are potentially reading ahead in terms of bytes, but not in terms of code-points. And that is fine because the user’s keyboard (for example) is generating code-points.
Yes it should be possible to do this. However such a reader would need to make up to 4 separate system calls in order to read a single code-point, and that is very inefficient.
No, it is not the preferred implementation. Yes, you could (in theory) buffer the stream yourself below the encoder. However most programs aren’t written to build the stack like this:
instead they just do this:
which would make your approach perform really slowly. (And you try explaining to the average Joe programmer why he should put an extra explicit buffering layer into the stack.)
If they didn’t do something like this, performance would be terrible … see above. Besides, this is documented behavior – the javadoc says:
The bottom line is that your use-case (where you want absolutely no low-level read-ahead on a
Readerstack.) is highly unusual, and not supported by the Java SE standard class library. If you really need this, feel free to implement your own version ofInputStreamReaderthat doesn’t read ahead. But it strikes me as a bit odd that you would really need this.