Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8992267
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T22:54:51+00:00 2026-06-15T22:54:51+00:00

By default, Hadoop splits the files to be processed by a Mapper on the

  • 0

By default, Hadoop splits the files to be processed by a Mapper on the file’s block boundaries. That is, that’s what the FileInputFormat implementation does for getSplits(). Hadoop then makes sure that the blocks to be processed by a Mapper are replicated on the Datanode the Mapper runs on.

Now I’m wondering, if I need to read outside of this InputSplit (in a RecordReader, but that’s irrelevant), what does this cost me as opposed to reading inside the InputSplit – Assuming that the data outside of it is not present on the reading Datanode?

EDIT:

In other words:
I am a RecordReader and have been assigned an InputSplit that spans one file block. I have a local copy of this file block (rather, the datanode I’m running on does), but not the rest of the file. Now I do need to read outside of this InputSplit, because I need to read the file header which is at the very beginning. Then I need to skip across records in the file (by reading just the records headers which tells me how long each record is and than skipping that amount of bytes). I need to do this until I encounter the first record that’s inside the InputSplit. Then I can start reading the actual records within my InputSplit. That is the only way to make sure that I will start at a valid record boundary.

Question: When I do read outside of the InputSplit, when is the data from the non-local file blocks copied? Is this done one byte at a time (i.e. once per call of InputStream.read()), or is the entire file block (of the current InputStream position) copied to my local datanode once I call InputStream.read() until I encounter the next non-local file block, etc? I need to know this so I can estimate how much overhead will be produced by skipping through the file.

Thanks 🙂

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T22:54:52+00:00Added an answer on June 15, 2026 at 10:54 pm

    In best of my understanding if data is not resided on local datanode – it will not be involved in reading it. HDFS client will ask NameNode where blocks are sitting and will directly speak with relevant datanodes in order to get the blocks.
    So cost will be – on remote datanode : read from disk, calculate CRC, send to the network, on code reading data – get from the network.
    I think cluster-wise price is only network bandwidth and some CPU spent on sending, receiving.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am using hadoop to process an xml file,so i had written mapper file
I have a mapper that, while processing data, classifies output into 3 different types
By default on Windows XP Mercurial stores the .hgrc file in (well, in my
I have a single node instance of Apache Hadoop 1.1.1 with default parameter values
I wrote an application that tried to create a default HBaseConfiguration, but when I
My data format uses \0 instead of new line. So default hadoop textLine reader
There's a limit for Hadoop counter size. It's 120 by default. I try to
By default, what time zone does method java.util.Date.toString() display? Since a Java Date stores
I'm using Amazon EMR (Hadoop 0.20.205) and noticed that one of my tasks are
Default build of PyPy 1.7 with stackless included in, does not offer the ability

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.