In the following code, how much does renaming fields after a join hurt the computation time of the script? Is it optimized in Pig? Or does it really go through every record?
-- tables A: (f1, f2, id) and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;
Does the FOREACH command go through every record of C? If yes, is there a way to optimize?
Thanks.
Don’t worry about optimizing this, there may be a slight overhead in renaming the fields, but it won’t trigger an addition Map/Reduce job. The field projection will occur in the reducer after your
JOIN.Consider the two pieces of code and the Map Reduce plans given by
explainbelow.Without Renaming
With Renaming
The difference is in the Reduce plans. Without renaming:
versus with renaming:
In short, there will be other things you can optimize in your script before worrying about renaming. Since you’ll be going through every record anyway because of the
join, renaming will just be a cheap extra step.