I want to make a HTTP-request using node.js to load some text from a webserver. Since the response can contain much text (some Megabytes) I want to process each text chunk separately. I can achieve this using the following code:
var req = http.request(reqOptions, function(res) {
...
res.setEncoding('utf8');
res.on('data', function(textChunk) {
// process utf8 text chunk
});
});
This seems to work without problems. However I want to support HTTP-compression, so I use zlib:
var zip = zlib.createUnzip();
// NO res.setEncoding('utf8') here since we need the raw bytes for zlib
res.on('data', function(chunk) {
// do something like checking the number of bytes downloaded
zip.write(chunk); // give the raw bytes to zlib, s.b.
});
zip.on('data', function(chunk) {
// convert chunk to utf8 text:
var textChunk = chunk.toString('utf8');
// process utf8 text chunk
});
This can be a problem for multi-byte characters like '\u00c4' which consists of two bytes: 0xC3 and 0x84. If the first byte is covered by the first chunk (Buffer) and the second byte by the second chunk then chunk.toString('utf8') will produce incorrect characters at the end/beginning of the text chunk. How can I avoid this?
Hint: I still need the buffer (more specifically the number of bytes in the buffer) to limit the number of downloaded bytes. So using res.setEncoding('utf8') like in the first example code above for non-compressed data does not suit my needs.
Single Buffer
If you have a single
Bufferyou can use itstoStringmethod that will convert all or part of the binary contents to a string using a specific encoding. It defaults toutf8if you don’t provide a parameter, but I’ve explicitly set the encoding in this example.Streamed Buffers
If you have streamed buffers like in the question above where the first byte of a multi-byte
UTF8-character may be contained in the firstBuffer(chunk) and the second byte in the secondBufferthen you should use aStringDecoder. :This way bytes of incomplete characters are buffered by the
StringDecoderuntil all required bytes were written to the decoder.