I am trying to parse XML messages which are send to my C# application over TCP. Unfortunately, the protocol can not be changed and the XML messages are not delimited and no length prefix is used. Moreover the character encoding is not fixed but each message starts with an XML declaration <?xml>. The question is, how can i read one XML message at a time, using C#.
Up to now, I tried to read the data from the TCP stream into a byte array and use it through a MemoryStream. The problem is, the buffer might contain more than one XML messages or the first message may be incomplete. In these cases, I get an exception when trying to parse it with XmlReader.Read or XmlDocument.Load, but unfortunately the XmlException does not really allow me to distinguish the problem (except parsing the localized error string).
I tried using XmlReader.Read and count the number of Element and EndElement nodes. That way I know when I am finished reading the first, entire XML message.
However, there are several problems. If the buffer does not yet contain the entire message, how can I distinguish the XmlException from an actually invalid, non-well-formed message? In other words, if an exception is thrown before reading the first root EndElement, how can I decide whether to abort the connection with error, or to collect more bytes from the TCP stream?
If no exception occurs, the XmlReader is positioned at the start of the root EndElement. Casting the XmlReader to IXmlLineInfo gives me the current LineNumber and LinePosition, however it is not straight forward to get the byte position where the EndElement really ends. In order to do that, I would have to convert the byte array into a string (with the encoding specified in the XML declaration), seek to LineNumber,LinePosition and convert that back to the byte offset. I try to do that with StreamReader.ReadLine, but the stream reader gives no public access to the current byte position.
All this seams very inelegant and non robust. I wonder if you have ideas for a better solution. Thank you.
After locking around for some time I think I can answer my own question as following (I might be wrong, corrections are welcome):
I found no method so that the
XmlReadercan continue parsing a second XML message (at least not, if the second message has anXmlDeclaration).XmlTextReader.ResetStatecould do something similar, but for that I would have to assume the same encoding for all messages. Therefor I could not connect theXmlReaderdirectly to the TcpStream.After closing the
XmlReader, the buffer is not positioned at the readers last position. So it is not possible to close the reader and use a new one to continue with the next message. I guess the reason for this is, that the reader could not successfully seek on every possible input stream.When
XmlReaderthrows an exception it can not be determined whether it happened because of an premature EOF or because of a non-wellformed XML.XmlReader.EOFis not set in case of an exception. As workaround I derived my own MemoryBuffer, which returns the very last byte as a single byte. This way I know that theXmlReaderwas really interested in the last byte and the following exception is likely due to a truncated message (this is kinda sloppy, in that it might not detect every non-wellformed message. However, after appending more bytes to the buffer, sooner or later the error will be detected.I could cast my
XmlReaderto theIXmlLineInfointerface, which gives access to theLineNumberand theLinePositionof the current node. So after reading the first message I remember these positions and use it to truncate the buffer. Here comes the really sloppy part, because I have to use the character encoding to get the byte position. I am sure you could find test cases for the code below where it breaks (e.g. internal elements with mixed encoding). But up to now it worked for all my tests.Here is the parser class I came up with — may it be useful (I know, its very far from perfect…)