I’m running a PIG script, and it all goes very quickly, until I get to the FOREACH ... GENERATE FLATTEN(...) line.
Is there a reason that that line should run so slowly. (It causes the entire script to time out on a fairly powerful cluster)
extended = FOREACH kRecords GENERATE *, NORMALIZE(query) AS query_norm:chararray;
-- DESCRIBE extended;
-- extended: {query: chararray,url: chararray,query_norm: chararray}
-- GROUP by both query and url
grouped = GROUP extended BY (query_norm, url);
-- DESCRIBE grouped;
-- grouped: {group: (query_norm: chararray,url: chararray),extended: {(query: chararray,url: chararray,query_norm: chararray)}}
-- Remove multiple items per record (but at the expense of duplicating records)
-- THE LINE BELOW IS THE SLOW ONE!!!
flattened = FOREACH grouped GENERATE FLATTEN(extended.query_norm), FLATTEN(extended.url);
-- THE LINE ABOVE IS THE SLOW ONE!!!
-- Remove duplicates
result = DISTINCT flattened;
Thanks,
Barry
When 2 FLATTEN(…) operators are used together after GENERATE you get Cartesian product between the 2 bags. So if a bag produced by the GROUP has N elements, after 2 FLATTEN(..) operators on the same bag you will get N*N rows generated per each group, it can tax heavily CPUs, HDDs and network. See following example:
CODE:
INPUT:
OUTPUT:
See how 2 records of (1,a) and 2 of (1,b) had caused 4 output records each. But 1 record of (1,c) caused just 1 output record.