I need to debug a XML parser and I am wondering if I can construct “malicious” input that will cause it to not recognize opening and closing tags correctly.
Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won’t have trouble with other special characters such as &, = , ", etc.
UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:
If the highest bit is not set, i.e. the code unit is
0xxxxxxx, then this is byte expresses an entire code point, whose value isxxxxxxx(i.e. 7 bits of information).If the highest bit is set and the code unit is
10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.As you can see, a value 60, which is
00111100, is a single-byte codepoint of value60, and the same byte cannot occur as part of any multibyte sequence.The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.
Update: As @Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like
10111100, but you’d have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can’t trust the browser, you don’t really get around decoding everything and just checking the resulting code point sequence for occurrences of U+3C etc., and don’t even bother looking at the byte stream.