I’ve been looking around for days trying to find a way using reduced data for further mapping in hadoop. I’ve got objects of class A as input data and objects of class B as output data. The Problem is, that while mapping not only Bs are generated but new As as well.
Here’s what I’d like to achieve:
1.1 input: a list of As
1.2 map result: for each A a list of new As and a list of Bs is generated
1.3 reduce: filtered Bs are saved as output, filtered As are added to the map jobs
2.1 input: a list of As produced by the first map/reduce
2.2 map result: for each A a list of new As and a list of Bs is generated
2.3 ...
3.1 ...
You should get the basic idea.
I’ve read a lot about chaining but I’m not sure how to combine ChainReducer and ChainMapper or even if this would be the right approach.
So here’s my question: How can I split the mapped data while reducing to save one part as output and the other part as new input data.
Try using MultipleOutputs. As it’s Javadoc suggests:
Usage pattern for job submission:
Usage in Reducer: