I have written a simple C++ shell program to parse large XML files and fix syntax errors.
I have so far covered everything I can think of except strings within strings, for example.
<ROOT>
<NODE attribute="This is a "string within" a string" />
<ROOT>
My program loops through the entire xml file character by character(keeping only a few characters in memory at a time for efficiency), it looks for things such as &<> etc and escapes them with & > < etc. A basic example of what I am doing can be found at the accepted answer for this Escaping characters in large XML files
The question is: What conditions or logic can I use to detect “string within” to be able to escape the quotes to this:
<ROOT>
<NODE attribute="This is a "string within" a string" />
<ROOT>
Is it even possible at all?
I think it’s difficult to decide where the attribute ends and another begins. I think you need to restrict the possible input you can parse otherwise you will have ambiguous cases such as this one:
These are either two attributes or one attribute.
One assumption you could make is that after an equal number of double quotes and an equal sign a new attribute begins. Then you simply replace all the inner double quotes with your escape string. Or any equal sign after 2 ore more double quotes means new attribute. The same could be assumed for the end of node.