I need to parse few XML’s to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions
- using SAXParser
- use Hadoop
i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data
it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?
I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way… the other is to write SAX code) since you can still operate on the records in a dom-like fashion.
With these large files, one thing to keep in mind is that you’ll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output… this will speed things up quite a bit.
I’ve written a quick outline of how I’ve handled all this, maybe it’ll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple….