I have an XML (assuming it is valid) and I must parse it and store it in a tree.
What is the best approach to parse it, without using other libraries, just basic manipulation of strings?
Keep in mind that I don’t have to validate it, just parse and memorize it into a tree.
The basic structure of XML is quite simple:
where the content may contain both normal text and more XML structures, or the special form
which is equivalent to
that is,. empty content.
So if you don’t need to interpret a DTD or do other fancy things, you can do the following:
Check that the first non-whitespace character is
<. If not, you don’t have XML and can just give an error and exit.Now follows the tag name, until the first whitespace, or the
/or the>character. Store that.If the next non-whitespace character is
/, check that it is followed by>. If so, you’ve finished parsing and can return your result. Otherwise, you’ve got malformed XML, and can exit with an error.If the character is
>, then you’ve found the end of the begin tag. Now follows the content. Continue at step 6.Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a
<character.If that character is followed by
/, it’s the end tag. Check that it is followed by the tag name and>, and if yes, return the result. Otherwise, throw an error.If you get here, you’ve found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.