I need to parse several large size XML files (one is ~8GB, others are ~4MB each) and merge them. Since both SAX and Tie::File are not suitable due to memory and time issues, I decided to try Twig.
Suppose each XML file is composed of several elements as follows:
<class name=math>
<student>luke1</student>
... (a very very long list of student)
<student>luke8000000</student>
</class>
<class name=english>
<student>mary1</student>
...
<student>mary1000000</student>
</class>
As you see, even if I use TwigRoots => {"class[\@name='english']" => \&counter} I still need to wait a long time for Twig to start to parse class=english because it needs to go over each line of class=math first (correct me if it does not need to go over each line).
Is there any way to let Twig start the parsing from a line number, rather than the beginning of a file? I can get the line number of <class name = english> using grep, which is much faster.
Thanks in advance.
Perhaps this example will give you some ideas for an alternative strategy. In particular, you might be able to combine the idea in
index_filewith Zoul’s suggestion about seeking to a location before passing off the file handle toXML::Twig.