Suppose I have a tab delimited file containing user activity data formatted like this:
timestamp user_id page_id action_id
I want to write a hadoop job to count user actions on each page, so the output file should look like this:
user_id page_id number_of_actions
I need something like composite key here – it would contain user_id and page_id. Is there any generic way to do this with hadoop? I couldn’t find anything helpful. So far I’m emitting key like this in mapper:
context.write(new Text(user_id + "\t" + page_id), one);
It works, but I feel that it’s not the best solution.
Just compose your own
Writable. In your example a solution could look like this:Although I think your IDs could be a
long, here you have theStringversion. Basically just the normal serialization over theWritableinterface, note that it needs the default constructor so you should always provide one.The
compareTologic tells obviously how to sort the dataset and also tells the reducer what elements are equal so they can be grouped.ComparisionChainis a nice util of Guava.Don’t forget to override equals and hashcode! The partitioner will determine the reducer by the hashcode of the key.