Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7163243
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T13:52:13+00:00 2026-05-28T13:52:13+00:00

I’ve been starting to learn hadoop, and currently I’m trying to process log files

  • 0

I’ve been starting to learn hadoop, and currently I’m trying to process log files that are not too well structured – in that the value I normally use for the M/R key is typiclly found at the top of the file (once). So basically my mapping function takes that value as key and then scans the rest of the file to aggregate the values needed to be reduced. So a [fake] log might look like this:

## log.1
SOME-KEY
2012-01-01 10:00:01 100
2012-01-02 08:48:56 250
2012-01-03 11:01:56 212
.... many more rows

## log.2
A-DIFFERENT-KEY
2012-01-01 10:05:01 111
2012-01-02 16:46:20 241
2012-01-03 11:01:56 287
.... many more rows

## log.3
SOME-KEY
2012-02-01 09:54:01 16
2012-02-02 05:53:56 333
2012-02-03 16:53:40 208
.... many more rows

I want to accumulate the 3rd column for each key. I have a cluster of several nodes running this job, and so I was bothered by several issues:

1. File Distribution

Given that hadoop’s HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine – how does the M/R framework match the two (if at all)?

2. Block Assignment

For text logs such as the ones described, how is each block’s cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

3. File structure

What is the optimal file structure (if any) for M/R processing? I’d probably be far less worried if a typical log looked like this:

A-DIFFERENT-KEY 2012-01-01 10:05:01 111
SOME-KEY        2012-01-02 16:46:20 241
SOME-KEY        2012-01-03 11:01:56 287
A-DIFFERENT-KEY 2012-02-01 09:54:01 16
A-DIFFERENT-KEY 2012-02-02 05:53:56 333
A-DIFFERENT-KEY 2012-02-03 16:53:40 208
...

However, the logs are huge and it would be very costly (time) to convert them to the above format. Should I be concerned?

4. Job Distribution

Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I’m trying to guarentee that my shady log structure still yields correct results.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T13:52:14+00:00Added an answer on May 28, 2026 at 1:52 pm

    Given that hadoop’s HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine – how does the M/R framework match the two (if at all)?

    How the keys and the values are mapped depends on the InputFormat class. Hadoop has a couple of InputFormat classes and custom InputFormat classes can also be defined.

    If FileInputFormat is used then the key to the mapper is the file off-set and the value is the line in the input file. In most of cases the file off-set is ignored and the value which is a line in the input file is processed by the mapper. So, by default each line in the log file will be a value to to the mapper.

    There might be case where related data in a log file as in the OP might be split across blocks, each block will be processed by a different mapper and Hadoop cannot relate them. One way it to let a single mapper process the complete file by using the FileInputFormat#isSplitable method. This is not an efficient approach if the file size is too large.

    For text logs such as the ones described, how is each block’s cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.

    Each block in HDFS by default is exactly 64MB size unless the file size is less than 64MB or the default block size has been modfied, record boundaries are not considered. Some part of the line in the input can be in one block and the rest in another. Hadoop understands record boundaries, so even if a record (line) is split across blocks, it will be still processed by a single mapper only. For this some data transfer might be required from the next block.

    Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I’m trying to guarentee that my shady log structure still yields correct results.

    Not exactly clear what the query is. Would suggest to go through some tutorials and get back with queries.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I need a function that will clean a strings' special characters. I do NOT
I have thousands of HTML files to process using Groovy/Java and I need to
I'm trying to create an if statement in PHP that prevents a single post
I have a jquery bug and I've been looking for hours now, I can't
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.