I have a question on Pig when performing what seems like two levels of groupings. As an example, let’s say I had some example input data like:
email_id:chararray from:chararray to:bag{recipients:tuple(recipient:chararray)}
e1 user1@example.com {(friend1@example.com),(friend2@example.com),(friend3@myusers.com)}
e2 user1@example.com {(friend1@example.com),(friend4@example.com)}
e3 user1@example.com {(friend5@example.com)}
e4 user2@example.com {(friend2@example.com),(friend4@example.com)}
So each line is an email from a user “from” to user(s) “to”.
And I ultimately want a list of all senders and all the people they’ve sent emails to, including the # of emails sent for each person, sorted from highest to lowest, for example:
user1@example.com {(friend1@example.com, 2), (friend2@example.com, 1), (friend3@example.com, 1), (friend4@example.com, 1), (friend5@example.com, 1)}
user2@example.com {(friend2@example.com, 1), (friend4@example.com, 1)}
Ideas on the best way to tackle this in Pig would be appreciated!
Here is one version of the script:
Hope it helps.