I am trying to programmatically clean up invalid XML with duplicate root elements in C# .NET 4.0. What I want to do is consolidate all of the inner elements into one root element and remove the duplicates roots, so that
<a>
<b></b>
</a>
<a>
<c></c>
</a>
becomes
<a>
<b></b>
<c></c>
</a>
However, the duplicated root element could also appear in the inner XML. In that case, we would not want to replace it, so that
<a>
<a></a>
<b></b>
</a>
<a>
<c></c>
<a></a>
</a>
becomes
<a>
<a></a>
<b></b>
<c></c>
<a></a>
</a>
Also, the duplicated root element isn’t guaranteed to always be <a>; it could have any name.
Thus far I’ve been trying to think of some sort of elegant Regex to accomplish this task, such as /<((.|\n|\r)*?)>(.|\n|\r)*<\/\1>/, but the problem with this is that a greedy match on the inner XML matches too much, and non-greedy match on the inner XML matches too little.
I was hoping I wouldn’t have to resort to creating a stack to count open and close tags to identify when I was back to the root of the document. I’m looking for a simple and elegant way of solving this problem.
Open source, third-party libraries are potentially acceptable solutions if one of them handles this kind of situation, but I’d rather avoid them.
Does anyone have any ideas?
It may be better to actually read XML as XML… You should be able to create reader with ConformanceLevel set to Fragment and read all fragments as normal XML. And than use normal XML processing to select/copy Xml nodes.