Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7673383
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T16:29:58+00:00 2026-05-31T16:29:58+00:00

Am getting started with Hadoop, and am working on building a MapReduce chain for

  • 0

Am getting started with Hadoop, and am working on building a MapReduce chain for “customers who bought x also bought y”, where y is the product that is purchased most frequently with x. I am looking for advice on increasing the efficiency of this task, by which I mean reducing the amount of data shuffled from mapper nodes to reducer node. My goal is a little different than other “customer bought x” scenarios, because I simply want to store the most commonly purchased product for a given product, not a list of products purchased with a given product ranked by frequency.

I am following this blog post to guide my approach.

If, as I understand, one of the big performance limiters in Hadoop is shuffling data from the mapper nodes to the reducer node, then, for every phase of the MapReduce chain, I want to keep the amount of shuffled data at a minimum.

Let’s say my initial data set is a SQL table purchases_products, a join table between a purchase and products that were bought in that purchase. I’ll feed select x.product_id, y.product_id from purchases_products x inner join purchases_products y on x.purchase_id = y.purchase_id and x.product_id != y.product_id into my MapReduce operation.

My MapReduce strategy is to map product_id_x, product_id_y to product_id_x_product_id_y, 1 and then sum the values in my reduce step. At then end I can split the keys and store pairs back to a SQL table.

My problem with this operation is that it shuffles a potentially huge number of rows, even though the size of the result set I want to produce is only count(products) big. Ideally, I’d like to have a combiner step narrow the amount of rows shuffled to reducers during this phase, but I don’t see a way to reliably do this.

Is this simply a limitation of the task at hand, or are there Hadoop tricks for organizing the workflow that will help me shrink the data shuffle during the second step? Is my worry about shuffle size appropriate in this case, or not?

Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T16:29:59+00:00Added an answer on May 31, 2026 at 4:29 pm

    Depending on how big your products set is (therefore defining the number of possible product pairs), you could look into map side ‘local’ aggregation.

    Maintain a map of product pairs to frequency count in your mapper, and rather than writing each product pair and the value 1 to the context, accumulate them in a map. When the map gets to a predefined size, flush the map to the output context. You could even use an LRU Map to keep the most frequently observed pairs in the map, and write out those ‘expired’ entries when they are forced out.

    For an example adapted for the Word Count example, see http://www.wikidoop.com/wiki/Hadoop/MapReduce/Mapper#Map_Aggregation

    Of course, if you have a huge product set, or random product pairings, this isn’t going to save you that much. You also need to understand how big your map can get before you expire the JVM memory available.

    You can also look into reducing the amount of data stored in your output Key / Value objects:

    • Are the product IDs integers (are they relatively low in value – can they benefit from being written as a VIntWritable rather than IntWritable?)
    • If they are integers, are you writing out the product pair key as a String representation of the IDs concatenated, or using a custom Key with two int fields (therefore writing 4+4 bytes rather than a potentially larger number if you use a string representation)
    • Are you writing the value ‘1’ out as a VIntWritable?
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Just getting started with C++ here. I am working on OSX with Eclipse CDT.
hey all, just getting started on hadoop and curious what the best way in
I'm just getting started with learning Hadoop, and I'm wondering the following: suppose I
I'm just getting started working with foreign keys for the first time and I'm
Just getting started with OpenFrameworks and I'm trying to do something that should be
Looking through the getting started guide for scopes, it seems to imply that I
Just getting started with Obj-C and iOS programming. I have some code that loads
I'm just getting started with CruiseControl .NET (using the Manning Continuous integration book that
I'm just getting started learning how to use Subversion for building my web applications,
Just getting started with backbone.js, and one of the things I've noticed is that

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.