Suppose I have two key-value data sets–Data Sets A and B, let’s call them.

Question

0

Asked: June 11, 20262026-06-11T12:10:49+00:00 2026-06-11T12:10:49+00:00

Suppose I have two key-value data sets–Data Sets A and B, let’s call them.

0

Suppose I have two key-value data sets–Data Sets A and B, let’s call them. I want to update all the data in Set A with data from Set B where the two match on keys.

Because I’m dealing with such large quantities of data, I’m using Hadoop to MapReduce. My concern is that to do this key matching between A and B, I need to load all of Set A (a lot of data) into the memory of every mapper instance. That seems rather inefficient.

Would there be a recommended way to do this that doesn’t require repeating the work of loading in A every time?

Some pseudcode to clarify what I’m currently doing:

Load in Data Set A # This seems like the expensive step to always be doing
Foreach key/value in Data Set B:
   If key is in Data Set A:
      Update Data Seta A

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T12:10:51+00:00

According to the documentation, the MapReduce framework includes the following steps:

Map
Sort/Partition
Combine (optional)
Reduce

You’ve described one way to perform your join: loading all of Set A into memory in each Mapper. You’re correct that this is inefficient.

Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.

To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Suppose I have two key-value data sets–Data Sets A and B, let’s call them.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply