I need to re – arrange a large XML document (size > 50 GB) in a given order of tags.
For ex:
order[] = {o3,o2,o1};
Inputfile:
<objects>
<o1>
// Some Data
</o1>
<o2>
// Some Data
</o2>
<o3>
// Some Data
</o3>
</objects>
Outputfile :
<objects>
<o3>
// Some Data
</o3>
<o2>
// Some Data
</o2>
<o1>
// Some Data
</o1>
</objects>
My approach:
I read the file from starting till i encounter the objects tag then i create temporary files of tags o1,o2,o3 and do this till i reach the end of file. Now create a new file using the order. I used C++ ifstream, ofstream to perform the above task;
This approach took 6hrs to do the following task.
The function prototype is : void Rearrange(string tag,string inputfile);
The object count in the 50GB file is greater than 12000000.
Can anyone suggest me another approach to improve the performance?
Thanks in advance.
That’s fairly easy. Get a 64 bit machine, and memory-map the entire input and output file. Get pointers to all the tags in the input file, sort the pointers by tag, and copy them in sorted order to the output file. Your disk performance will become the main bottleneck.