I am trying very hard with no luck to take an XML document which is spit out by a proprietary database and transform it into a well-formed XML document which will eventually be indexed by Apache Solr.
I would like to take this XML file and transform it into a Apache Solr format like that below it.
<?xml version="1.0" encoding="UTF-8" ?>
<ecatalogue>
<tuple>
<table name="CatObjectName_tab">
<tuple>
<atom name="CatObjectName">Clog</atom>
</tuple>
</table>
<atom name="CatObjectNumber">2003-39-27A</atom>
<atom name="CatObjectTitle"></atom>
<table name="CatOtherNumbers_tab">
<tuple>
<atom name="CatOtherNumbers">1895.1.117a</atom>
</tuple>
</table>
<table name="ProPlaceName_tab">
<tuple>
<atom name="ProPlaceName">China</atom>
</tuple>
</table>
<table name="CatOtherNumberType_tab">
<tuple>
<atom name="CatOtherNumberType">Other Number</atom>
</tuple>
</table>
<atom name="DatDateMade"></atom>
<atom name="DatEarliestDateMadeOrig"></atom>
<atom name="DatLatestDateMadeOrig"></atom>
</tuple>
<tuple>
<table name="CatObjectName_tab">
<tuple>
<atom name="CatObjectName">Boot</atom>
</tuple>
</table>
<atom name="CatObjectNumber">2003-39-20B</atom>
<atom name="CatObjectTitle"></atom>
<table name="CatOtherNumbers_tab">
<tuple>
<atom name="CatOtherNumbers">1895.1.91b</atom>
</tuple>
</table>
<table name="ProPlaceName_tab">
<tuple>
<atom name="ProPlaceName">China</atom>
</tuple>
</table>
<table name="CatOtherNumberType_tab">
<tuple>
<atom name="CatOtherNumberType">Other Number</atom>
</tuple>
</table>
<atom name="DatDateMade"></atom>
<atom name="DatEarliestDateMadeOrig"></atom>
<atom name="DatLatestDateMadeOrig"></atom>
</tuple>
</ecatalogue>
I would like to transform the above into this:
<add>
<doc>
<field name="ProPlaceName">China</field>
<field name="CatObjectTitle"></field>
<field name="CatObjectNumber">2003-39-27A</field>
<field name="CatOtherNumberType">Other Number</field>
<field name="CatOtherNumbers">1895.1.117a</field>
<field name="CatObjectName_tab">Clog</field>
<field name="DatDateMade"></field>
<field name="DatEarliestDateMadeOrig"></field>
<field name="DatLatestDateMadeOrig"></field>
</doc>
<!-- Row 2 -->
<doc>
<field name="ProPlaceName">China</field>
<field name="CatObjectTitle"></field>
<field name="CatObjectNumber">2003-39-20B</field>
<field name="CatOtherNumberType">Other Number</field>
<field name="CatOtherNumbers">1895.1.91b</field>
<field name="CatObjectName_tab">Boot</field>
<field name="DatDateMade"></field>
<field name="DatEarliestDateMadeOrig"></field>
<field name="DatLatestDateMadeOrig"></field>
</doc>
</add>
Is it best to try and use XSL/XSLT or use something like java or another programming language to make the transformation? How would you approach this problem and can you point me in the right direction?
I believe it can be done using XSL. Any help is appreciated.
Here’s something that should help. It’s fairly simple, and assumes that you are skipping any nested tables…instead only grabbing the atoms within them. It does not sort the fields in any specific order.