I wanted to know: Does hadoop mapreduce re-process the entire dataset if the same

Question

0

Asked: May 27, 20262026-05-27T20:55:19+00:00 2026-05-27T20:55:19+00:00

I wanted to know: Does hadoop mapreduce re-process the entire dataset if the same

0

I wanted to know:
Does hadoop mapreduce re-process the entire dataset if the same job is submitted twice?
For example: the word count example counts the occurrence of each word in each file in an input folder.
If I were to add a file to that folder, and re-run the word count mapreduce job, will the initial files be re-read, re-maped and re-reduced?

If so, is there a way to configure hadoop to process ONLY the new files and add it to a “summary” from previous mapreduce runs.

Any thought/help will be appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T20:55:19+00:00

If I were to add a file to that folder, and re-run the word count mapreduce job, will the initial files be re-read, re-maped and re-reduced?

Hadoop will reprocess the entire data when run again. The output of the mappers and the temporary data is deleted when the job has been completed successfully.

If so, is there a way to configure hadoop to process ONLY the new files and add it to a “summary” from previous mapreduce runs.

Hadoop as-is doesn’t support such as scenario, but you could write a custom InputFormat which checks for the unprocessed or new files and a cutom OutputFormat which will add data to the summary from the previous run. Or else once the job has been run, the new files to be processed can be put in a different input folder and let the Job process only the files in the new folder.

Check this article in creating custom input/output formats.

I am not sure of the exact requirements but you can also consider frameworks which process streams of data like HStreaming, S4, Twitter Storm and others.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I wanted to know: Does hadoop mapreduce re-process the entire dataset if the same

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply