I have a large xml file that looks like this:
20120124 07:30:15.301, saving to queue
<logmessage>
<logline1>some data</logline1>
<logline2>some data too</logline2>
</logmessage>
20120124 07:30:15.302, processing message
<logmessage>
<logline1>some data</logline1>
<logline2>some data too</logline2>
</logmessage>
I want to split it into multiple files, each containing one logmessage, and I don’t want to keep any data outside the root node. How can I do this?
Be careful what you wish for. Consider the consequences of what you are doing. If this is a very large XML file as you have stated this will create a very large number of small files in your directory. That can be bad in many ways. Each will take up at least the smallest block size which can be large on todays massive filesystems. Each will take an inode on linux which is a finite resource, use df -i to determine if you have enough available. Finally some files systems have a limit or begin to perform poorly if too many files are created in the same directory.
The following will tell you how many files will be created:
The following will create a new file using the date and time for the file name with a .xml extension. If multiple messages have the same timestamp they will be appended.
Also be aware that many XML libraries will try to open the full file in memory which can be a problem for a very large xml file. This procedure will not attempt to open the whole file in memory. If your file is too large to fit in memory do not accept any solution that uses an XML parser that is not SAX or streaming. Using a DOM parser will require memory equal to your document size times a multiple for overhead.