I want to run a Pig script by splitting out two tuples (or whatever it’s called in Pig), based off of criteria in col2, and after manipulating col2, into another column, compare the two manipulated tuples and do an additional exclude.
REGISTER /home/user1/piggybank.jar;
log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);
--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;
is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;
Splitting and manipulating is the easy part. This is where it gets complicated. . .
merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};
If I can figure out this line(s), the rest would fall in place.
merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;
STORE merge_limited into 'file';
Here’s an example of the I/O:
col1 col2 manipulated
This qWerty W
Is qweRty R
An qwertY Y
Example qwErty E
Of qwerTy T
Example Qwerty Q
Data qWerty W
isnt
E
Y
col1 col2
This qWerty
Is qweRty
Of qwerTy
Example Qwerty
Data qWerty
I’m still not sure quite what you need, but I believe you can reproduce your input and output with the following (untested):
With the
COGROUP, you group all tuples in each relation by the grouping key. If the bag of tuples fromexcludeis empty, it means that the grouping key was not present in the exclude list, so you keep tuples frommwith that key. Conversely, if the grouping key was present inexclude, that bag will not be empty and the tuples frommwith that key will be filtered out.