Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6920961
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T10:12:55+00:00 2026-05-27T10:12:55+00:00

hadoop writes in a SequenceFile in in key-value pair(record) format. Consider we have a

  • 0

hadoop writes in a SequenceFile in in key-value pair(record) format. Consider we have a large unbounded log file. Hadoop will split the file based on block size and save them on multiple data nodes. Is it guaranteed that each key-value pair will reside on a single block? or we may have a case so that key is in one block on node 1 and value(or parts of it) on second block on node 2? If we may have unmeaning-full splits, then what is the solution? sync markers?

Another question is: Does hadoop automatically write sync markers or we should write it manually?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T10:12:56+00:00Added an answer on May 27, 2026 at 10:12 am

    I asked this question in hadoop mailing list. They answered:

    Sync markers are written into sequence files already, they are part of
    the format. This is nothing to worry about – and is simple enough to
    test and be confident about. The mechanism is same as reading a text
    file with newlines – the reader will ensure reading off the boundary
    data in order to complete a record if it has to.

    then I asked:

    So if we have a map job analysing only the second block of the log
    file, it should not transfer any other parts of that from other nodes
    because that part is stand alone and meaning full split? Am I right?

    They answered:

    Yes. Simply put, your records shall never break. We do not read just
    at the split boundaries, we may extend beyond boundaries until a sync
    marker is encountered in order to complete a record or series of
    records. The subsequent mappers will always skip until their first
    sync marker, and then begin reading – to avoid duplication. This is
    exactly how text file reading works as well — only here, it is
    newlines.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing
Does Hadoop guarantee that different blocks from same file will be stored on different
I have a job in Hadoop 0.20 that needs to operate on large files,
One day I suspect I'll have to learn hadoop and transfer all this data
I have some large MySQL production tables that I need to dump so that
I have large amounts of data (a few terabytes) and accumulating... They are contained
I am writing to hadoop file system. But everytime I append something, it overwrites
I have written a class(es) that writes and reads from hdfs. Given certain conditions
I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when
I have a pipeline that I currently run on a large university computer cluster.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.