Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6808651
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T19:58:56+00:00 2026-05-26T19:58:56+00:00

I have a file in which a set of every four lines represents a

  • 0

I have a file in which a set of every four lines represents a record.

eg, first four lines represent record1, next four represent record 2 and so on..

How can I ensure Mapper input these four lines at a time?

Also, I want the file splitting in Hadoop to happen at the record boundary (line number should be a multiple of four), so records don’t get span across multiple split files..

How can this be done?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T19:58:57+00:00Added an answer on May 26, 2026 at 7:58 pm

    A few approaches, some dirtier than others:


    The right way

    You may have to define your own RecordReader, InputSplit, and InputFormat. Depending on exactly what you are trying to do, you will be able to reuse some of the already existing ones of the three above. You will likely have to write your own RecordReader to define the key/value pair and you will likely have to write your own InputSplit to help define the boundary.


    Another right way, which may not be possible

    The above task is quite daunting. Do you have any control over your data set? Can you preprocess it in someway (either while it is coming in or at rest)? If so, you should strongly consider trying to transform your dataset int something that is easier to read out of the box in Hadoop.

    Something like:

    ALine1
    ALine2            ALine1;Aline2;Aline3;Aline4
    ALine3
    ALine4        ->
    BLine1
    BLine2            BLine1;Bline2;Bline3;Bline4;
    BLine3
    BLine4
    

    Down and Dirty

    Do you have any control over the file sizes of your data? If you manually split your data on the block boundary, you can force Hadoop to not care about records spanning splits. For example, if your block size is 64MB, write your files out in 60MB chunks.

    Without worrying about input splits, you could do something dirty: In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don’t emit anything and move on without doing anything.

    The reason why you have to manually split the data is that you are not going to be guaranteed that an entire 4-row record will be given to the same map task.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a file which contains lines of data in the following format: a11
I have this file which I need to read the first bytes to check
I have a file which is combination of PHP and HTML. How can i
I have a file which may be in ASCII or UTF-8 format. I can
I have a file which i'm parsing out myself. Every time i spot a
I have a java class file which I need to run on every file
Suppose I have file which goes like this : a void measure() { a
I have a file which I'm trying to extract information from, the file has
I have a file which has multiple columns, whitespace separated. e.g: data1 data2 data3
I have .zip file which contain csv data. I am reading .zip file using

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.