I need to debug a XML parser and I am wondering if I can

Question

0

Asked: June 17, 20262026-06-17T23:00:00+00:00 2026-06-17T23:00:00+00:00

I need to debug a XML parser and I am wondering if I can

0

I need to debug a XML parser and I am wondering if I can construct “malicious” input that will cause it to not recognize opening and closing tags correctly.

Additionally, where can I find this sort of information in general? After this I will also want to be sure that the parser I am working with won’t have trouble with other special characters such as &, = , ", etc.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T23:00:02+00:00

UTF-8 makes it very easy to figure out what the role of a code unit (i.e. a byte) is:

If the highest bit is not set, i.e. the code unit is 0xxxxxxx, then this is byte expresses an entire code point, whose value is xxxxxxx (i.e. 7 bits of information).
If the highest bit is set and the code unit is 10xxxxxx, then it is a continuation part of a multibyte sequence, carrying six bits of information.
Otherwise, the code unit is the initial byte of a multibyte sequence, as follows:
- 110xxxxx: Two bytes (one continuation byte), for 5 + 6 = 11 bits.
- 1110xxxx: Three bytes (two continuation bytes), for 4 + 6 + 6 = 16 bits.
- 11110xxx: Four bytes (three continuation bytes), for 3 + 6 + 6 + 6 = 21 bits.

As you can see, a value 60, which is 00111100, is a single-byte codepoint of value 60, and the same byte cannot occur as part of any multibyte sequence.

The scheme can actually be extended up to seven bytes, encoding up to 36 bits, but since Unicode only requires 21 bits, four bytes suffice. The standard mandates that a code point must be represented with the minimal number of code units.

Update: As @Mark Tolonen rightly points out, you should check carefully whether each encoded code point is actually encoded with the minimal number of code units. If a browser would inadvertently accept such input, a user could sneak something past you that you would not spot in a byte-for-byte analysis. As a starting point you could look for bytes like 10111100, but you’d have to check the entire multibyte sequence of which it is a part (since it can of course occur legitimately as a part of different code points). Ultimately, if you can’t trust the browser, you don’t really get around decoding everything and just checking the resulting code point sequence for occurrences of U+3C etc., and don’t even bother looking at the byte stream.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to debug a XML parser and I am wondering if I can

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply