I’ve been stuck on this question for a while. I have a data file that looks like this:
2012/01/01 Name1 "Category1,Category2,Category3"
2012/01/01 Name2 "Category2,Category3"
2012/01/01 Name3 "Category1,Category5"
Each item is associated with a comma-separated list of categories. I would like to be able to group by category name, to get output like this:
Category1 Name1, Name3
Category2 Name1, Name2
...
Category5 Name3
(even more specifically, I don’t need the names of the items – just the counts of number of items in that category would do)
I ended up writing a UDF to take the comma-separated category field, and convert it to a Pig bag. My data schema is now something like this:
{date: chararray, name: chararray, categories: {t: (category:chararray)}}
I am stuck on the next step – actually performing a grouping by nested bag value. I have tried variations of nested FOREACH statement without any luck. For example:
x = FOREACH myData
{
categoryNames = FOREACH categories GENERATE category;
GENERATE myData.Name, categoryNames;
}
My thought was that this kind of syntax could generate tuples of (Name, category), which I can run a GROUP over. However, the actual output is the whole bag, taking me back to square 1. I am out of ideas on how to proceed – help/feedback would be most appreciated. Thanks!
Assuming each name is unique in your data file, you could FLATTEN the bag of category, then GROUP by category and COUNT the number of names by category.
e.g.