I’m writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.
The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.
- My process analyses the xml files (extracts only a few nodes).
- Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).
Now I’m thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:
- Creating multiple parsers and running them in parallel on the xml sources.
- Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, …)
- Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml – serial)
- Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency
I want to improve both, the performance overall and the “per file” performance.
Do you have experience with such problems? What is the best way to go?
This one is obvious: just create several parsers and run them in parallel in multiple threads.
Take a look at Woodstox Performance (down at the moment, try google cache).
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
In this case you could create simple splitter that searches
<element>and feeds this part to a particular parser instance. That’s a simplified approach: in real life I’d go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.Take a look at Aalto. The same guys that created Woodstox. This are experts in this area – don’t reinvent the wheel.