Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?
More specifically, here’s what I exactly need to do:
I am having two sets of data:
-
point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info
-
Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info
As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number
In order to clarify, here’s an example:
For point pairs:
(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)
for line pairs:
(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)
what I want is as following:
for tile 0:
(tile0, point0:line0)
(tile0, point0:line1)
(tile0, point1:line0)
(tile0, point1:line1)
for tile 1:
(tile1, point1:line2)
(tile1, point1:line3)
(tile1, point2:line2)
(tile1, point2:line3)
Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).
So the map output will be something like:
Then, at the reducer, your input will have this structure:
and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:
If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don’t need the special characters (*,_)
Regarding the cross-product which you have to do in the reducer:
You first iterate through the entire values List, separate them into 2 list:
Then do the cross-product using 2 nested for loops.
Then iterate through the resulting list and for each element output: