I have this XML
<data>
<peptides>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
<score>4000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
<score>6000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
<score>5000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
<score>5000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
<score>1000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
<score>8000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
<score>5000</score>
</peptide>
<peptide>
<accession>222</accession>
<sequence>CCC</sequence>
<score>5000</score>
</peptide>
<peptide>
<accession>222</accession>
<sequence>CCC</sequence>
<score>9000</score>
</peptide>
<peptide>
<accession>222</accession>
<sequence>CCC</sequence>
<score>2000</score>
</peptide>
</peptides>
</data>
With the following XSLT, I can get the peptides with ´accession´ “111”, eliminating the redundancy of sequences. So that I get this XML
<root>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
<score>4000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
<score>5000</score>
</peptide>
</root>
Here it is the XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:key name="byAcc" match="/data/peptides/peptide" use="accession" />
<xsl:key name="byAccSeq" match="/data/peptides/peptide" use="concat(accession, '|', sequence)"/>
<xsl:template match="/">
<root>
<xsl:apply-templates select="key('byAcc','111')
[
generate-id()
=
generate-id(key('byAccSeq', concat(accession, '|', sequence))[1])
]">
<xsl:sort select="sequence" data-type="text"/>
<xsl:sort select="score" data-type="number"/>
</xsl:apply-templates>
</root>
</xsl:template>
<xsl:template match="/data/peptides/peptide">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
And the live example here
Then, the problem is that from all the redundancy, the “selected” node is just the first that appears in the original XML.
I need to select, of all the redundant peptides (i.e., those with the same accession and sequence), the one with maximum score.
The wished XML would be this one then
<root>
<peptide>
<accession>111</accession>
<sequence>AAA</sequence>
<score>6000</score>
</peptide>
<peptide>
<accession>111</accession>
<sequence>BBB</sequence>
<score>8000</score>
</peptide>
</root>
If it is not clear, please let me know and I will re-edit the question. Thanks a lot.
Gerard
This transformation:
when applied on the provided XML document:
produces the wanted, correct result:
Explanation:
The template that processes every first element from a group, gets all the elements in the current group (using the
key()function).Then it uses a code snippet to find the ones of all these that have a maximum
score. Only the first such element is output.