I’m new to Hadoop MapReduce (4 days to be precise) and I’ve been asked

Question

0

Asked: June 12, 20262026-06-12T03:44:30+00:00 2026-06-12T03:44:30+00:00

I’m new to Hadoop MapReduce (4 days to be precise) and I’ve been asked

0

I’m new to Hadoop MapReduce (4 days to be precise) and I’ve been asked to perform distributed XML parsing on a cluster. As per my (re)search on the Internet, it should be fairly easy using Mahout’s XmlInputFormat, but my task is to make sure that the system works for huge (~5TB) XML files.

As per my knowledge, the file splits sent to the mappers cannot be larger than the hdfs block size (or the per-job block size). [Correct me if I’m mistaken].

The issue I’m facing is that some XML elements are large (~200MB) and some are small (~1MB)

So my question is: What happens when the XML element chunk created by XmlInputFormat is bigger than the block size? Will it send the entire large file (say 200MB) to a mapper or will it send out the element in three splits (64+64+64+8)??

I currently don’t have access to the company’s hadoop cluster (and wont be until sometime) so I cannot perform a test and find out. Kindly help me out.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T03:44:31+00:00

So to clear somethings up:

Mahout’s XMLInputFormat will process XML files and extract out the XML between two configured start / end tags. So if your XML looks like the following:

<main>
  <person>
    <name>Bob</name>
    <dob>1970/01/01</dob>
  </person>
</main>

and you’ve configured the start / end tags to be <person> and </person>, then your mapper will be passed the following <LongWritable, Text> pair to its map method:

LongWritable: 10
Text: "<person>\n    <name>Bob</name>\n    <dob>1970/01/01</dob>\n  </person>"

What you do with this data in your mapper is then up to you.

With regards to splits, XmlInputFormat extends TextInputFormat, so if you’re input file is splittable (i.e. uncompressed or compressed with a splittable codec such as snappy), then the file will be processed by one or more mappers as follows:

If the input file size (let’s say 48 MB) is less than a single block in HDFS (lets say 64MB), and you don’t configure min / max split size properties, then you’ll get a single mapper to process the file
As with the above, but you configure max split size to be 10MB (mapred.max.split.size=10485760), then you’ll get 5 map tasks to process the file
If the file is bigger than the block size then you’ll get a map task for each block, or if the max split size is configured, a map for each part of the file division by that split size

When the file is split up into these block or split sized chunks, the XmlInputFormat will seek to byte address/offset of the block / split boundaries and then scan forwards until it finds either the configured XML start tag or reaches the byte address of the block/split boundary. If it finds the start tag, it will then consume data until it finds the end tag (or end of file). If it finds the end tag a record will be passed to your mapper, otherwise your mapper will not receive any input. To emphasize, the map may scan past the end of the block / split when trying to find the end tag, but will only do this if it has found a start tag, otherwise scanning stops at the end of the block/split.

So to (eventually) answer your question, if you haven’t configured a mapper (and are using the default or identify mapper as it’s also known), then yes, it doesn’t matter how big the XML chunk is (MB’s, GB’s, TB’s!) it will be sent to the reducer.

I hope this makes sense.

EDIT

To follow up on your comments:

Yes, each mapper will attempt to process its split (range of bytes) of the file
Yes, regardless of what your set the max split size too, your mapper will receive records which represent the data between (inclusive) of the start / end tags. The person element will not be split up not matter what it’s size is (obviously if there is GB’s of data between the start and end element, you’ll most probably run out of memory trying to buffer it into a Text object)
Continuing from the above, your data will never be split up between the start and end element, a person element will be sent in its entirity to a mapper, so you should always be ok using something like a SAX parser to further process it without fear that you’re only seeing a portion of the person element.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m new to Hadoop MapReduce (4 days to be precise) and I’ve been asked

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply