Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 896973
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T14:48:56+00:00 2026-05-15T14:48:56+00:00

I have a massive amount of input data (that’s why I use Hadoop) and

  • 0

I have a massive amount of input data (that’s why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.

My goal: Compute these different tasks as fast as possible.

I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.

I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a “super key” that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the “superkey” which task to perform on the included key and values.

In pseudo code:

map(key, value):
    emit(SuperKey("Task 1", IncludedKey), value)
    emit(SuperKey("Task 2", AnotherIncludedKey), value)

reduce(key, values):
   if key.taskid == "Task 1":
      for value in values:
          // do stuff with key.includedkey and value
   else:
      // do something else

The key could be a WritableComparable which can include all the necessary information.

Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.

My questions are:

  • Is this a sensible approach?
  • Are there better alternatives?
  • Does it have some terrible drawback?
  • Would I need a custom Partitioner class for this approach?

Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.

The computation will eventually take place on Amazon’s Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T14:48:56+00:00Added an answer on May 15, 2026 at 2:48 pm
    • Is this a sensible approach?

    There’s nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs’ logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).

    • Are there better alternatives?

    It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you’ve already thought of this).

    • Does it have some terrible drawback?

    The only thing I can think of is that if one of the tasks’ map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it’s hard to say whether this would matter much. The same would hold for the reducers.

    • Would I need a custom Partitioner class for this approach?

    Probably, depending on what you’re doing. I think in general if you’re writing a custom output WritableComparable, you’ll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).

    HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a massive amount of data that needs to be read from mysql,
I have an app that needs to update a large amount of data over
I have a few Oracle procedures that generate/return a large amount of data that
There are cases when you have many UI updates due a massive amount of
I thought the Linux kernel would have a massive amount of SLOC but it
I have been looking into different types of timers that i could use for
Background I have a massive db for a SharePoint site collection. It is 130GB
Have you ever seen any of there error messages? -- SQL Server 2000 Could
We have a custom PHP/MySQL CMS running on Linux/Apache thats rolled out to multiple
I am developing an application which need to handle a massive amount of REST

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.