In OOXML, formatting such as bold, italic, etc. can be (and often annoyingly is) split up between multiple elements, like so:
<w:p>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t xml:space="preserve">This is a </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t xml:space="preserve">bold </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
<w:i/>
</w:rPr>
<w:t>with a bit of italic</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>paragr</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>a</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>ph</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"> with some non-bold in it too.</w:t>
</w:r>
</w:p>
I need to combine these formatting elements to produce this:
<p><b>This is a mostly bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too.</p>
My initial approach was going to be to write out the start formatting tag when it is first encountered using:
<xsl:text disable-output-escaping="yes"><b></xsl:text>
And then after I process each <w:r>, check the next one to see if the formatting is still present. If it’s not, add the end tag in the same way I add the start tag.
I keep thinking there must be a better way to do this, and I’d be grateful for any suggestions.
Should also mention that I am tied to XSLT 1.0.
The reason for needing this, is that we need to compare an XML file before it is transformed into OOXML, and after it is transformed out of OOXML. The extra formatting tags make it appear as though changes were made when they were not.
Here is a complete XSLT 1.0 solution:
when this transformation is applied to the following XML document (based on the provided, but made more complicated to show how more edge-cases are covered):
the wanted, correct result is produced:
Explanation:
pass1 result (indented for readability):
.2. The second pass (executed in mode
"pass2") merges any batch of consecutive and identically named elements into a single element with that name. It recursively calls-itself on the children of the merged elements — thus batches at any depth are merged..3. Do note: We do not (and cannot) use the axes
following-sibling::orpreceding-sibling, because only the nodes (to be merged) at the top level are really siblings. Due to this reason we process all nodes just as a node-set..4. This solution is completely generic — it merges any batch of consecutive identically-named elements at any depth — and no specific names are hardcoded.