In attempt to learn Hadoop, I am practicing unsolved programming questions from the book “Hadoop in Action”
Dataset Sample:
3070801,1963,1096,,”BE”,””,,1,,269,6,69,,1,,0,,,,,,,
3070802,1963,1096,,”US”,”TX”,,1,,2,6,63,,0,,,,,,,,,
3070803,1963,1096,,”US”,”IL”,,1,,2,6,63,,9,,0.3704,,,,,,,
3070804,1963,1096,,”US”,”OH”,,1,,2,6,63,,3,,0.6667,,,,,,,
3070805,1963,1096,,”US”,”CA”,,1,,2,6,63,,1,,0,,,,,,,
3070806,1963,1096,,”US”,”PA”,,1,,2,6,63,,0,,,,,,,,,
3070807,1963,1096,,”US”,”OH”,,1,,623,3,39,,3,,0.4444,,,,,,,
3070808,1963,1096,,”US”,”IA”,,1,,623,3,39,,4,,0.375,,,,,,,
3070809,1963,1096,,”US”,”AZ”,,1,,4,6,65,,0,,,,,,,,,
3070810,1963,1096,,”US”,”IL”,,1,,4,6,65,,3,,0.4444,,,,,,,
Map Function
public static class MapClass extends MapReduceBase implements Mapper<Text, Text, IntWritable, Text> {
private int maxClaimCount = 0;
private Text record = new Text();
public void map(Text key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
String claim = value.toString().split(",")[7];
//if (!claim.isEmpty() && claim.matches("\\d")) {
if (!claim.isEmpty()) {
int claimCount = Integer.parseInt(claim);
if (claimCount > maxClaimCount) {
maxClaimCount = claimCount;
record = value;
output.collect(new IntWritable(claimCount), value);
}
// output.collect(new IntWritable(claimCount), value);
}
}
}
Reduce Function
public static class Reduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
output.collect(key, values.next());
}
}
Command to Run:
hadoop jar ~/Desktop/wc.jar com/hadoop/patent/TopKRecords -Dmapred.map.tasks=7 ~/input ~/output
Requirement:
– Based on the ninth column value, find the top-K records(say 7) from dataset
Question:
– Since just 7 top records are needed I run seven map tasks and make sure that I get the highest number record as maxClaimCount and record
– I do not know how to collect just the maximum record so that each map emits just one output
How do I do that?
This is an updated answer. All comments are not applicable to it as they are based on original (incorrect) answer.
Mapper should only output
without any comparison. Result will be sorted based on claim count and passed to reducer.
In Reducer use some priority queue to pick up top 7 results.