This is my first time using map/reduce. I want to write a program that processes a large log file. For example, if I was processing a log file that had records consisting of {Student, College, and GPA}, and wanted to sort all students by college, what would be the ‘map’ part and what would be the ‘reduce’ part? I am having some difficulty with the concept, despite having gone over a number of tutorials and examples.
Thanks!
Technically speaking, Hadoop MapReduce treats everything as key-value pairs; you just need to define what the keys are and what the values are. The signatures of map and reduce are
with sorting taking place on K2 values in the intermediate shuffle phase between map and reduce.
If your inputs are of the form
Then your mapper should do nothing more than get the College values to the key:
with college as the new key, Hadoop will sort by college for you. Your reducer then, is just a plain old “identity reducer.”
If you are carrying out a sorting operation in practice (that is, this isn’t a homework problem), then check out Hive, or Pig. These systems drastically simplify these kinds of tasks. Sorting on a particular column becomes quite trivial. However, it is always educational to write, say, a hadoop streaming job for tasks like the one you identified here, to give you a better understanding of mappers and reducers.