I’m running a PIG script, and it all goes very quickly, until I get

Question

0

Asked: June 6, 20262026-06-06T21:39:28+00:00 2026-06-06T21:39:28+00:00

I’m running a PIG script, and it all goes very quickly, until I get

0

I’m running a PIG script, and it all goes very quickly, until I get to the FOREACH ... GENERATE FLATTEN(...) line.

Is there a reason that that line should run so slowly. (It causes the entire script to time out on a fairly powerful cluster)

extended = FOREACH kRecords GENERATE *, NORMALIZE(query) AS query_norm:chararray;
-- DESCRIBE extended;
-- extended: {query: chararray,url: chararray,query_norm: chararray}

-- GROUP by both query and url
grouped = GROUP extended BY (query_norm, url);
-- DESCRIBE grouped;
-- grouped: {group: (query_norm: chararray,url: chararray),extended: {(query: chararray,url: chararray,query_norm: chararray)}}

-- Remove multiple items per record (but at the expense of duplicating records)
-- THE LINE BELOW IS THE SLOW ONE!!!
flattened = FOREACH grouped GENERATE FLATTEN(extended.query_norm), FLATTEN(extended.url);
-- THE LINE ABOVE IS THE SLOW ONE!!!

-- Remove duplicates
result = DISTINCT flattened;

Thanks,
Barry

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T21:39:31+00:00

When 2 FLATTEN(…) operators are used together after GENERATE you get Cartesian product between the 2 bags. So if a bag produced by the GROUP has N elements, after 2 FLATTEN(..) operators on the same bag you will get N*N rows generated per each group, it can tax heavily CPUs, HDDs and network. See following example:

CODE:

inpt = load '/pig_fun/input/group.txt' as (c1, c2);
grp = group inpt by (c1, c2);
flt = foreach grp generate FLATTEN(inpt.c1), FLATTEN(inpt.c2);

INPUT:

OUTPUT:

(1,a)
(1,a)
(1,a)
(1,a)
(1,b)
(1,b)
(1,b)
(1,b)
(1,c)

See how 2 records of (1,a) and 2 of (1,b) had caused 4 output records each. But 1 record of (1,c) caused just 1 output record.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m running a PIG script, and it all goes very quickly, until I get

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply