I’m looking for an efficient way to compare and get differences between two XML-based parse trees.
What would you suggest to be the best way to store those differences? I would have done this:
XML A:
<w:p> <w:pPr> <w:spacing w:after='1'/> </w:pPr> <w:r> <w:t>World</w:t> </w:r> </w:p>
XML B:
<w:p> <w:pPr> <w:spacing w:after='1'/> </w:pPr> <w:r> <w:t>ASDF</w:t> </w:r> </w:p>
The algorithm determines that ‘World’ has changed to ‘ASDF’ and then stores:
div: <w:p><w:r><w:t>World</w:t> -> <w:p><w:r><w:t>ASDF</w:t>
Is this enough to cover all cases that might occur?
Does anybody know of a good way to do that? Any help would really be appreciated!
It might get harder. Look at this example:
To be able to recognize both cases, you’ll have to store one as
and the other as
or something similar (you might also want to add ‘w:p’ closing tags to both of them to make them valid XML sub-trees).
In general, such programs can get very complicated, so I wouldn’t recommend you to create something completely new but to either use some existing diff algorithm (most will be good enough even without parsing the XML structure) or to modify one of them to suit your needs.