I have a Stream that produces UTF-8 encoded strings. The strings represent XML documents that I need to parse. The stream is obtained from a TcpClient.
Suppose I read the stream into buffers of size 64 (a little small, I know). Passing these 64 byte buffers directly to the string decoding step could fail because some UTF-8 encoded characters may be split along the 64 byte boundary. The buffer may end with the first two bytes of a character and the next buffer has the last byte for this character.
What I do now, is concatenate buffers until I perform a read that doesn’t read the full 64 bytes, indicating that I have read to the end of something (in my case, an XML document). However, once in a while, an XML documents I read ends exactly at the 64 byte boundary. In such a case, I do not know I can pass the byte array to the decoding step (and I need to wait for the next document).
I realize I can lower the chances by increasing the buffer size. However, a small chance always remains that it happens. I could also increase the buffer size such that any XML document I encounter will fit, but I just wonder whether there is another solution, somehow detecting from the byte stream where the character boundaries are.
You are right about the problems and pitfalls.
The solution already exists: wrap a
StreamReaderaround your stream and useRead()andReadLine()If you do want a DIY solution you’ll have to look at the Encoder state properties. Beyond my capabilities.