I am using Python and have to work on following scenario using Hadoop Streaming:
a) Map1->Reduce1->Map2->Reduce2
b) I dont want to store intermediate files
c) I dont want to install packages like Cascading, Yelp, Oozie. I have kept them as last option.
I already went through the same kind of discussion on SO and elsewhere but could not find an answer wrt Python. Can you please suggest.
Any reason why? Based on the response, a better solution could be provided.
Intermediates files cannot be avoided, because the o/p of the previous Hadoop job cannot be streamed as i/p to the next job. Create a script like this