I have an XML structure that looks like this:
<root>
<index>
<item>item 1</item>
<item>item 2</item>
<!-- many more items -->
<index>
<data>
<row>
<!-- relates to item 1 -->
<cell>1</cell>
<cell>2</cell>
<!-- many more cells -->
</row>
<row>
<!-- relates to item 2 -->
<cell>3</cell>
<cell>4</cell>
<!-- many more cells -->
</row>
<!-- as many rows as there are items in the index -->
</data>
</root>
I’m trying to create a parser that outputs (to a database) a structure like this:
item 1 : [1, 2, ...]
item 2 : [3, 4, ...]
...
Normally, I’d use a sax parser, construct a HashMap, fill the keys when the parser passes the index element and afterwards add the cell data.
However, the document may contain a lot of data so I’m afraid I will run into memory issues.
My question is: how do I parse the file with as little memory usage as possible?
One thing I thought about was to construct two SAX parsers, one that runs over the index and another that parses the data. The problem is I have no idea how I can suspend one parser, start the other, suspend the other, restart the first one and so on.
Is this possible or are there better ways to deal with this?
BTW: sadly, I have absolutely no control over the format of the XML.
The SAX parser isn’t going to need to keep a lot in memory other than the hash map. I would SAX parse the index element to generate
List<Item>and then for each item element I can remove the item from the list (assert that it is in there, remove it) and then add toMap<Item,List<Cell>>.The memory that you are going to be needing is the total number of items and an entry for each cell. I don’t think you need to maintain much more context than that when parsing using SAX.