I have a many-to-many mapping table between two collections. Each row in the mapping table represents a possible mapping with a weight score.
mapping(id1, id2, weight)
Query: Generate one to one mapping between id1 and id2. Use lowest weight to remove duplicate mappings. If there is tie, output any arbitrary one.
Example input:
(1, X, 1)
(1, Y, 2)
(2, X, 3)
(2, Y, 1)
(3, Z, 2)
Output
(1, X)
(2, Y)
(3, Z)
1 and 2 are both mapped to X and Y. We pick mapping (1, X) and (2, Y) because they have the lowest weight.
Solved it by using Java UDF. it’s not perfect in a sense that it won’t maximize the number of one-to-one mappings but it’s good enough.
Pig:
Java UDF: