I am attempting to perform some text canonicalization to replace some contractions. Here is some example input:
<?xml version="1.0"?>
<transcript>
<p id="p1">
<s id="s1"><w>Here</w><w>'s</w> <w>an</w> <w>example</w>, <w>let</w><w>'s</w> <w>consider</w> <w>it</w></s>
<s id="s2"><w>Here</w> <w>'s</w> <w>an</w> <w>example</w>, <w>let</w><w>'s</w> <w>consider</w> <w>it</w></s>
<s id="s3"><foo><w>Here</w></foo><bar><w>'s</w></bar> <w>an</w> <w>example</w>, <foo><w>let</w></foo><w>'s</w> <w>consider</w> <w>it</w></s>
<s id="s4"><w>Here</w><bar><baz><w>'s</w></baz></bar> <w>an</w> <w>example</w>, <baz><bar><w>let</w></bar><w>'s</w></baz> <w>consider</w> <w>it</w></s>
<s id="s5"><w>Look</w> <w>here</w></s>
<s id="s6"><w>'s</w> <w>another</w> <w>example</w></s>
</p>
</transcript>
In this example, I want to replace “here’s” with “hers is” and “let’s” with “let us”. Thus, my desired output is,
<?xml version="1.0"?>
<transcript>
<p id="p1">
<s id="s1"><w>Here</w> <w>is</w> <w>an</w> <w>example</w>, <w>let</w> <w>us</w> <w>consider</w> <w>it</w></s>
<s id="s2"><w>Here</w> <w>is</w> <w>an</w> <w>example</w>, <w>let</w> <w>us</w> <w>consider</w> <w>it</w></s>
<s id="s3"><foo><w>Here</w></foo> <bar><w>is</w></bar> <w>an</w> <w>example</w>, <foo><w>let</w></foo> <w>us</w> <w>consider</w> <w>it</w></s>
<s id="s4"><w>Here</w> <bar><baz><w>is</w></baz></bar> <w>an</w> <w>example</w>, <baz><bar><w>let</w></bar> <w>us</w></baz> <w>consider</w> <w>it</w></s>
<s id="s5"><w>Look</w> <w>here</w></s>
<s id="s6"><w>'s</w> <w>another</w> <w>example</w></s>
</p>
</transcript>
I was able to put together some (probably nothing near elegant or optimal) code that can handle s1 and s2, but I do not see that I can generalize it to something useful.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="w[translate(text(),'S','s')="'s"][preceding-sibling::*[1]/self::w[translate(text(),'HERE','here')='here']]">
<xsl:text> </xsl:text>
<xsl:copy><xsl:copy-of select="@*"/>is</xsl:copy>
</xsl:template>
<xsl:template match="w[translate(text(),'S','s')="'s"][preceding-sibling::*[1]/self::w[translate(text(),'LET','let')='let']]">
<xsl:text> </xsl:text>
<xsl:copy><xsl:copy-of select="@*"/>us</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Some details:
-
Assume words are all wrapped in
<w>tags and that the “words” of interest are consecutive (though not necessarily siblings) -
Arbitrary tags may wrap one or the other or both of the word and the ‘s.
-
The substitution should not cross sentence
<s>boundaries (as shown in s5 and s6) – though if this is impossible, I will not cry. -
If a space already exists between word and ‘s, I still want to replace the ‘s. The exact spacing of the result (one space or two) does not matter.
-
Ideally, the space will be added to the nearest common ancestor of the two
<w>tags containing the word and the ‘s.
Thanks for any guidance you can give!
This transformation fulfills all the requirements:
when applied on the provided XML document:
the wanted result is produced: