I’m writing an application which processes a lot of xml files (>1000) with deep

Question

0

Asked: May 18, 20262026-05-18T03:58:15+00:00 2026-05-18T03:58:15+00:00

I’m writing an application which processes a lot of xml files (>1000) with deep

0

I’m writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.

The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.

My process analyses the xml files (extracts only a few nodes).
Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).

Now I’m thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:

Creating multiple parsers and running them in parallel on the xml sources.
Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, …)
Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml – serial)
Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency

I want to improve both, the performance overall and the “per file” performance.

Do you have experience with such problems? What is the best way to go?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T03:58:16+00:00

This one is obvious: just create several parsers and run them in parallel in multiple threads.
Take a look at Woodstox Performance (down at the moment, try google cache).
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
```
<element>
    <more>more elements</more>
</element> 
<element>
    <other>other elements</other>
</element>
```
In this case you could create simple splitter that searches <element> and feeds this part to a particular parser instance. That’s a simplified approach: in real life I’d go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.
Take a look at Aalto. The same guys that created Woodstox. This are experts in this area – don’t reinvent the wheel.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing an application which processes a lot of xml files (>1000) with deep

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply