I have to files that I need to combine together using Apache PIG. First file contains list of book titles, like this with each title being on the line by itself.
Ted Dunning, Mahout in Action
Leo Tolstoy, War and Peace
Douglas Adams, The hitchhiker guide to the galaxy.
James Sununu, galaxy III for Dummies
Tom McArthur, The War we went to
the second file is the list of words and their IDs. Like this
ted, 12
tom, 13
douglas, 14
galaxy, 15
war, 16
leo, 17
peace, 18
I need to join these two files to produce the output like this:
for the line ‘Leo Tolstoy, War and piece’ it should produce
17:1,16:1,18:1
for the line ‘Tom McArthur, The War we went to’ it should produce
13:1,16:1
In other words, I need to perform the join using the word as a key. So far I’ve written the following code in pig
titles = LOAD 'Titles' AS ( title : chararray );
termIDs = LOAD 'TermIDs' AS ( term:chararray,id:int);
A = SAMPLE titles 0.01;
X = FOREACH A GENERATE STRSPLIT(title,'[ _\\[\\]\\/,\\.\\(\\)]+');
This gives gets both files loaded and X contains the list of BAGS each bag containing the terms that occur on the corresponding line. Like this:
((ted,dunning,mahout,in,action))
((leo,tolstoy,war,and,peace))
For the reason of being late on Saturday night, I can’t figure out the way to JOIN step without writing a UDF or using streaming. Is it even possible to do using only PIG primitives.
You can FLATTEN the results of the TOKENIZE, thus all of the bag become rows and now you can join the X relation with termsID.
The above code was typed on my mobile phone, so it was not debugged.
UPDATE 1:
For cases when it is preferable to use STRSPLIT instead of TOKENIZE you could do a combination of FLATTEN and TOBAG to achieve the same effect as TOKENIZE, which is getting a bag of words from a tuple returned by STRSPLIT.
If any of the title exceeds 20 terms than increase the number in the TOBAG.