I have IMDB collection from INEX, which consist few million XML files in few thousand directories with structure like this:
- actors
-- 000
--- person_1000.xml
--- ...
-- 001
--- person_1001.xml
--- ...
...
- movies
-- 000
--- 10000.xml
--- ...
...
I need to convert these files to TRECTEXT format, which is
<DOC>
<DOCNO> document_number </DOCNO>
<TEXT> XML file goes here. </TEXT>
</DOC>
Where document_number should be file name without extension, e.x. person_1000.xml -> person_1000 and contents of XML file should be wrapped in tags.
I assume I need some script which wraps every XML file in collection in , and tags as shown above and overwrites original file. Could you help me please?
I’m not familiar with TRECTEXT format, but here’s a one-liner using Perl that should do what you want:
Obviously remove the
.bakextension if you don’t wish to keep any backup files. Please let me know if you have any problems. Cheers.Update, as per comments: