I need to parse few XML’s to TSV, the Size of the XML Files

Question

0

Asked: June 11, 20262026-06-11T02:21:28+00:00 2026-06-11T02:21:28+00:00

I need to parse few XML’s to TSV, the Size of the XML Files

0

I need to parse few XML’s to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions

using SAXParser
use Hadoop

i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data

it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T02:21:30+00:00

I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way… the other is to write SAX code) since you can still operate on the records in a dom-like fashion.

With these large files, one thing to keep in mind is that you’ll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output… this will speed things up quite a bit.

I’ve written a quick outline of how I’ve handled all this, maybe it’ll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple….

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to parse few XML’s to TSV, the Size of the XML Files

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply