I have to files that I need to combine together using Apache PIG. First

Question

0

Asked: June 12, 20262026-06-12T06:36:36+00:00 2026-06-12T06:36:36+00:00

I have to files that I need to combine together using Apache PIG. First

0

I have to files that I need to combine together using Apache PIG. First file contains list of book titles, like this with each title being on the line by itself.

Ted Dunning,  Mahout in Action
Leo Tolstoy,  War and Peace
Douglas Adams, The hitchhiker guide to the galaxy.
James Sununu,  galaxy III for Dummies
Tom McArthur,  The War we went to

the second file is the list of words and their IDs. Like this

ted, 12
tom, 13
douglas, 14
galaxy, 15
war, 16
leo, 17
peace, 18

I need to join these two files to produce the output like this:

for the line ‘Leo Tolstoy, War and piece’ it should produce

17:1,16:1,18:1

for the line ‘Tom McArthur, The War we went to’ it should produce

13:1,16:1

In other words, I need to perform the join using the word as a key. So far I’ve written the following code in pig

titles = LOAD 'Titles' AS ( title : chararray );  
termIDs = LOAD  'TermIDs' AS (  term:chararray,id:int);

A = SAMPLE titles 0.01;
X = FOREACH A GENERATE STRSPLIT(title,'[ _\\[\\]\\/,\\.\\(\\)]+');

This gives gets both files loaded and X contains the list of BAGS each bag containing the terms that occur on the corresponding line. Like this:

((ted,dunning,mahout,in,action))
((leo,tolstoy,war,and,peace))

For the reason of being late on Saturday night, I can’t figure out the way to JOIN step without writing a UDF or using streaming. Is it even possible to do using only PIG primitives.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T06:36:37+00:00

You can FLATTEN the results of the TOKENIZE, thus all of the bag become rows and now you can join the X relation with termsID.

X = foreach A generate title, flatten(TOKENIZE(title)) as term;
J = join X by (term),  termIDs by (term);
G = group J by title;
Result = foreach G generate group as title, termIDs.id;

The above code was typed on my mobile phone, so it was not debugged.

UPDATE 1:

For cases when it is preferable to use STRSPLIT instead of TOKENIZE you could do a combination of FLATTEN and TOBAG to achieve the same effect as TOKENIZE, which is getting a bag of words from a tuple returned by STRSPLIT.

SPLT = foreach A generate title, FLATTEN(STRSPLIT(title,'[ _\\[\\]\\/,\\.\\(\\)]+'));
X_tmp = foreach SPLT generate $0 as title, FLATTEN(TOBAG($1..$20)) as term; -- pivots the row
X = filter X_tmp by term is not null; -- this removes the extra bag rows when title was split in less than 20 terms
J = join X by (term),  termIDs by (term) using 'replicated';
G = group J by title;
Result = foreach G generate group as title, termIDs.id;

If any of the title exceeds 20 terms than increase the number in the TOBAG.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have to files that I need to combine together using Apache PIG. First

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply