what i have in output is: word , file —– —— wordx Doc2, Doc1,

Question

0

Editorial Team

Asked: June 3, 20262026-06-03T03:24:07+00:00 2026-06-03T03:24:07+00:00

what i have in output is: word , file —– —— wordx Doc2, Doc1,

0

what i have in output is:

word , file
—– ——
wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1

what i want is:

word , file
—– ——
wordx Doc2, Doc1

public static class LineIndexMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, Text, Text> {

    private final static Text word = new Text();
    private final static Text location = new Text();

    public void map(LongWritable key, Text val,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {
        FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
        String fileName = fileSplit.getPath().getName();
        location.set(fileName);

        String line = val.toString();
        StringTokenizer itr = new StringTokenizer(line.toLowerCase());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            output.collect(word, location);
        }
    }
}

public static class LineIndexReducer extends MapReduceBase
        implements Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterator<Text> values,
            OutputCollector<Text, Text> output, Reporter reporter)
            throws IOException {

        boolean first = true;
        StringBuilder toReturn = new StringBuilder();
        while (values.hasNext()) {
            if (!first) {
                toReturn.append(", ");
            }
            first = false;
            toReturn.append(values.next().toString());
        }

        output.collect(key, new Text(toReturn.toString()));
    }
}

for the best performance – where should i skip the recurring file name? map,reduce or both?
ps: i am a beginner in writing MR tasks and also trying to figure out programming logic with my question.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T03:24:09+00:00

You will only be able to remove duplicates in the Reducer. To do so, you can use a Set, which does not allow duplicates.

public void reduce(Text key, Iterator<Text> values,
        OutputCollector<Text, Text> output, Reporter reporter)
        throws IOException {

    // Text's equals() method should be overloaded to make this work
    Set<Text> outputValues = new HashSet<Text>();

    while (values.hasNext()) {
      // make a new Object because Hadoop may mess with original
      Text value = new Text(values.next());

      // takes care of removing duplicates
      outputValues.add(value);
    }

    boolean first = true;
    StringBuilder toReturn = new StringBuilder();
    Iterator<Text> outputIter = outputValues.iter();
    while (outputIter.hasNext()) {
        if (!first) {
            toReturn.append(", ");
        }
        first = false;
        toReturn.append(outputIter.next().toString());
    }

    output.collect(key, new Text(toReturn.toString()));
}

Edit: Adds copy of value to Set as per Chris’ comment.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

what i have in output is: word , file —– —— wordx Doc2, Doc1,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply