what i have in output is:
word , file
—– ——
wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1
what i want is:
word , file
—– ——
wordx Doc2, Doc1
public static class LineIndexMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
private final static Text word = new Text();
private final static Text location = new Text();
public void map(LongWritable key, Text val,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
boolean first = true;
StringBuilder toReturn = new StringBuilder();
while (values.hasNext()) {
if (!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(values.next().toString());
}
output.collect(key, new Text(toReturn.toString()));
}
}
for the best performance – where should i skip the recurring file name? map,reduce or both?
ps: i am a beginner in writing MR tasks and also trying to figure out programming logic with my question.
You will only be able to remove duplicates in the Reducer. To do so, you can use a Set, which does not allow duplicates.
Edit: Adds copy of value to Set as per Chris’ comment.