I want to run a Pig script by splitting out two tuples (or whatever

Question

0

Asked: June 14, 20262026-06-14T14:03:53+00:00 2026-06-14T14:03:53+00:00

I want to run a Pig script by splitting out two tuples (or whatever

0

I want to run a Pig script by splitting out two tuples (or whatever it’s called in Pig), based off of criteria in col2, and after manipulating col2, into another column, compare the two manipulated tuples and do an additional exclude.

REGISTER /home/user1/piggybank.jar;

log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);

--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;

is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;

Splitting and manipulating is the easy part. This is where it gets complicated. . .

merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};

If I can figure out this line(s), the rest would fall in place.

merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;

STORE merge_limited into 'file';

Here’s an example of the I/O:

col1                col2            manipulated
This                qWerty          W
Is                  qweRty          R
An                  qwertY          Y
Example             qwErty          E
Of                  qwerTy          T
Example             Qwerty          Q
Data                qWerty          W


isnt
E
Y


col1                col2
This                qWerty
Is                  qweRty
Of                  qwerTy
Example             Qwerty
Data                qWerty

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T14:03:55+00:00

I’m still not sure quite what you need, but I believe you can reproduce your input and output with the following (untested):

data = LOAD 'input' AS (col1:chararray, col2:chararray);
exclude = LOAD 'exclude' AS (excl:chararray);

m = FOREACH data GENERATE col1, col2, YourUDF(col2) AS manipulated;
test = COGROUP m BY manipulated, exclude BY excl;

-- Here you can choose IsEmpty or NOT IsEmpty according to whether you want to exclude or include
final = FOREACH (FILTER test BY IsEmpty(exclude)) GENERATE FLATTEN(m);

With the COGROUP, you group all tuples in each relation by the grouping key. If the bag of tuples from exclude is empty, it means that the grouping key was not present in the exclude list, so you keep tuples from m with that key. Conversely, if the grouping key was present in exclude, that bag will not be empty and the tuples from m with that key will be filtered out.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to run a Pig script by splitting out two tuples (or whatever

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply