I’m new to Hadoop MapReduce (4 days to be precise) and I’ve been asked to perform distributed XML parsing on a cluster. As per my (re)search on the Internet, it should be fairly easy using Mahout’s XmlInputFormat, but my task is to make sure that the system works for huge (~5TB) XML files.
As per my knowledge, the file splits sent to the mappers cannot be larger than the hdfs block size (or the per-job block size). [Correct me if I’m mistaken].
The issue I’m facing is that some XML elements are large (~200MB) and some are small (~1MB)
So my question is: What happens when the XML element chunk created by XmlInputFormat is bigger than the block size? Will it send the entire large file (say 200MB) to a mapper or will it send out the element in three splits (64+64+64+8)??
I currently don’t have access to the company’s hadoop cluster (and wont be until sometime) so I cannot perform a test and find out. Kindly help me out.
So to clear somethings up:
Mahout’s XMLInputFormat will process XML files and extract out the XML between two configured start / end tags. So if your XML looks like the following:
and you’ve configured the start / end tags to be
<person>and</person>, then your mapper will be passed the following<LongWritable, Text>pair to its map method:What you do with this data in your mapper is then up to you.
With regards to splits,
XmlInputFormatextendsTextInputFormat, so if you’re input file is splittable (i.e. uncompressed or compressed with a splittable codec such as snappy), then the file will be processed by one or more mappers as follows:mapred.max.split.size=10485760), then you’ll get 5 map tasks to process the fileWhen the file is split up into these block or split sized chunks, the XmlInputFormat will seek to byte address/offset of the block / split boundaries and then scan forwards until it finds either the configured XML start tag or reaches the byte address of the block/split boundary. If it finds the start tag, it will then consume data until it finds the end tag (or end of file). If it finds the end tag a record will be passed to your mapper, otherwise your mapper will not receive any input. To emphasize, the map may scan past the end of the block / split when trying to find the end tag, but will only do this if it has found a start tag, otherwise scanning stops at the end of the block/split.
So to (eventually) answer your question, if you haven’t configured a mapper (and are using the default or identify mapper as it’s also known), then yes, it doesn’t matter how big the XML chunk is (MB’s, GB’s, TB’s!) it will be sent to the reducer.
I hope this makes sense.
EDIT
To follow up on your comments: