I try to implement Hash join in Hadoop. However, Hadoop seems to have already

Question

0

Editorial Team

Asked: May 15, 20262026-05-15T06:00:54+00:00 2026-05-15T06:00:54+00:00

I try to implement Hash join in Hadoop. However, Hadoop seems to have already

0

I try to implement Hash join in Hadoop.

However, Hadoop seems to have already a map-side join and a reduce – side join already implemented.

What is the difference between these techniques and hash join?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T06:00:54+00:00

Map-side Join

In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you’d write

edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';

taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.

Reduce Join

In a reduce-side join, you group on the join key using hadoop’s standard merge sort.

<user_id   {A, B, F, ..., Z},  { A, C, G, ..., Q} >

and emit a record for every pair of an element from the first set with an element from the second set:

[A   user_id    A]
[A   user_id    C]
...
[A   user_id    Q]
...
[Z   user_id    Q]

You should design your keys so that the dataset with the fewest records per key comes first — you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).

Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key’s groups must fit in memory (actually each key’s group except the last one). It’s possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.

Merge Join

Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.

Because of this, when generating a frequently-read dataset it’s often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I try to implement Hash join in Hadoop. However, Hadoop seems to have already

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply