I want to introduce a deterministic sorting to my [OWL] (http://www.w3.org/TR/owl-ref/) file so that I can compare a modified file to original and more easily see where it has been changed. This file is produced by a tool (Protege) and the ordering of elements varies semi-randomly.
The problem is that sorting can’t be based on simple things like given element’s name and attributes. Often the differences appear only in the child nodes few levels below.
Example:
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
Here subclass definitions 1 and 2 (and further child elements inside those) vary in order, sometimes 1 is the first, sometimes 2.
I implemented a sort based on a few common direct attributes such a s about and ID, and while this fixes many ambiguous orderings, it can’t fix this. XSLT:
<xsl:stylesheet version="2.0"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()">
<xsl:sort select="@rdf:about" data-type="text"/>
<xsl:sort select="@rdf:ID" data-type="text"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
I’m thinking that maybe the solution needs to be able to calculate some kind of “hash-code” for each element, which takes into account all contents of it’s child elements. This way subclass definition 1 could have hash-code 3487631 and subclass definition 2 would have 45612, and sorting between them would be deterministic (in case their child elements are unmodified).
EDIT: Just realized that the hashcode calculation should not care about the child note ordering to achieve what it is trying to do.
I could primarily use direct known attribute values and then hash-code, if those are equal. I probably would end up with something like:
<xsl:stylesheet version="2.0"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()">
<xsl:sort select="@rdf:about" data-type="text"/>
<xsl:sort select="@rdf:ID" data-type="text"/>
<xsl:sort select="my:hashCode(.)" />
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
<xsl:function name="my:hashCode" as="xs:string">
...
</xsl:function>
</xsl:stylesheet>
but have no clue on how to implement my:hashCode.
EDIT: as requested, a few examples. The tool may, more or less randomly, produce for example the following kinds of results (1-3) when saving the same data:
1.
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
2.
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
3.
<owl:Class rdf:about="#SomeFooClass">
<rdfs:subClassOf><!-- subclass definition 2 -->
<owl:Restriction>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
<owl:onProperty>
<owl:DatatypeProperty rdf:about="#name"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf><!-- subclass definition 1 -->
<owl:Restriction>
<owl:onProperty>
<owl:DatatypeProperty rdf:ID="negate"/>
</owl:onProperty>
<owl:maxCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int"
>1</owl:maxCardinality>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
These examples are a simplified version of the structure but should show the principle. I want to implement a XSLT sorting that will produce identical output for all 3 examples. Whether the transformed result looks like version 1, 2, or 3 (or some other ordering) is not that important.
I ended up implementing the sorting in Java after all.
Basically I sort the DOM recursively starting from children:
and the actual sorting first compares element name and a few important attribute names, and if those all are equal, compares normalized string conversion of the node and it’s children (children’s contents are guaranteed to be already sorted at this point)