I am trying to write a MapReduce application in which the Mapper passes a set of values to the Reducer as follows:
Hello
World
Hello
Hello
World
Hi
Now these values are to be grouped and counted first and then some further processing is to be done. The code I wrote is:
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<String> records = new ArrayList<String>();
/* Collects all the records from the mapper into the list. */
for (Text value : values) {
records.add(value.toString());
}
/* Groups the values. */
Map<String, Integer> groupedData = groupAndCount(records);
Set<String> groupKeys = groupedData.keySet();
/* Writes the grouped data. */
for (String groupKey : groupKeys) {
System.out.println(groupKey + ": " + groupedData.get(groupKey));
context.write(NullWritable.get(), new Text(groupKey + groupedData.get(groupKey)));
}
}
public Map<String, Integer> groupAndCount(List<String> records) {
Map<String, Integer> groupedData = new HashMap<String, Integer>();
String currentRecord = "";
Collections.sort(records);
for (String record : records) {
System.out.println(record);
if (!currentRecord.equals(record)) {
currentRecord = record;
groupedData.put(currentRecord, 1);
} else {
int currentCount = groupedData.get(currentRecord);
groupedData.put(currentRecord, ++currentCount);
}
}
return groupedData;
}
But in the output I get a count of 1 for all. The sysout statements are printed something like:
Hello
World
Hello: 1
World: 1
Hello
Hello: 1
Hello
World
Hello: 1
World: 1
Hi
Hi: 1
I cannot understand what the issue is and why not all records are received by the Reducer at once and passed to the groupAndCount method.
As you note in your comment, if each value has a different corresponding key then they will not be reduced in the same reduce call, and you’ll get the output you’re currently seeing.
Fundamental to Hadoop reducers is the notion that values will be collected and reduced for the same key – i suggest you re-read some of the Hadoop getting started documentation, especially the Word Count example, which appears to be roughly what you are trying to achieve with your code.