Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8608923
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T03:44:30+00:00 2026-06-12T03:44:30+00:00

I’m new to Hadoop MapReduce (4 days to be precise) and I’ve been asked

  • 0

I’m new to Hadoop MapReduce (4 days to be precise) and I’ve been asked to perform distributed XML parsing on a cluster. As per my (re)search on the Internet, it should be fairly easy using Mahout’s XmlInputFormat, but my task is to make sure that the system works for huge (~5TB) XML files.

As per my knowledge, the file splits sent to the mappers cannot be larger than the hdfs block size (or the per-job block size). [Correct me if I’m mistaken].

The issue I’m facing is that some XML elements are large (~200MB) and some are small (~1MB)

So my question is: What happens when the XML element chunk created by XmlInputFormat is bigger than the block size? Will it send the entire large file (say 200MB) to a mapper or will it send out the element in three splits (64+64+64+8)??

I currently don’t have access to the company’s hadoop cluster (and wont be until sometime) so I cannot perform a test and find out. Kindly help me out.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T03:44:31+00:00Added an answer on June 12, 2026 at 3:44 am

    So to clear somethings up:

    Mahout’s XMLInputFormat will process XML files and extract out the XML between two configured start / end tags. So if your XML looks like the following:

    <main>
      <person>
        <name>Bob</name>
        <dob>1970/01/01</dob>
      </person>
    </main>
    

    and you’ve configured the start / end tags to be <person> and </person>, then your mapper will be passed the following <LongWritable, Text> pair to its map method:

    LongWritable: 10
    Text: "<person>\n    <name>Bob</name>\n    <dob>1970/01/01</dob>\n  </person>"
    

    What you do with this data in your mapper is then up to you.

    With regards to splits, XmlInputFormat extends TextInputFormat, so if you’re input file is splittable (i.e. uncompressed or compressed with a splittable codec such as snappy), then the file will be processed by one or more mappers as follows:

    1. If the input file size (let’s say 48 MB) is less than a single block in HDFS (lets say 64MB), and you don’t configure min / max split size properties, then you’ll get a single mapper to process the file
    2. As with the above, but you configure max split size to be 10MB (mapred.max.split.size=10485760), then you’ll get 5 map tasks to process the file
    3. If the file is bigger than the block size then you’ll get a map task for each block, or if the max split size is configured, a map for each part of the file division by that split size

    When the file is split up into these block or split sized chunks, the XmlInputFormat will seek to byte address/offset of the block / split boundaries and then scan forwards until it finds either the configured XML start tag or reaches the byte address of the block/split boundary. If it finds the start tag, it will then consume data until it finds the end tag (or end of file). If it finds the end tag a record will be passed to your mapper, otherwise your mapper will not receive any input. To emphasize, the map may scan past the end of the block / split when trying to find the end tag, but will only do this if it has found a start tag, otherwise scanning stops at the end of the block/split.

    So to (eventually) answer your question, if you haven’t configured a mapper (and are using the default or identify mapper as it’s also known), then yes, it doesn’t matter how big the XML chunk is (MB’s, GB’s, TB’s!) it will be sent to the reducer.

    I hope this makes sense.

    EDIT

    To follow up on your comments:

    1. Yes, each mapper will attempt to process its split (range of bytes) of the file
    2. Yes, regardless of what your set the max split size too, your mapper will receive records which represent the data between (inclusive) of the start / end tags. The person element will not be split up not matter what it’s size is (obviously if there is GB’s of data between the start and end element, you’ll most probably run out of memory trying to buffer it into a Text object)
    3. Continuing from the above, your data will never be split up between the start and end element, a person element will be sent in its entirity to a mapper, so you should always be ok using something like a SAX parser to further process it without fear that you’re only seeing a portion of the person element.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm new to using the Perl treebuilder module for HTML parsing and can't figure
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I'm parsing an XML file, the creators of it stuck in a bunch social
I have a jquery bug and I've been looking for hours now, I can't
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I want use html5's new tag to play a wav file (currently only supported
In my XML file chapters tag has more chapter tag.i need to display chapters
This could be a duplicate question, but I have no idea what search terms
We are using XSLT to translate a RIXML file to XML. Our RIXML contains

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.