Does anyone know any Apache pig documentation which list down all the operators (like group by, streaming, etc) and the corresponding action taken by PIG i.e what kind/count of MR job(s) the operator results in?
I am specifically interested in streaming aspect, how does it maps to MR job(s).
However far not a complete list, but I think it’s worth reading the following articles/sections:
Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience
(Section 4. Compilation to MapReduce)
http://infolab.stanford.edu/~olston/publications/vldb09.pdf
Pig Latin: A Not-So-Foreign Language for Data Processing
(Chapter 4.2 Map-Reduce Plan Compilation)
http://infolab.stanford.edu/~olston/publications/sigmod08.pdf
Furthermore you can always issue EXPLAIN or ILLUSTRATE on your script
to see what happens behind the scenes.