I wanted to know:
Does hadoop mapreduce re-process the entire dataset if the same job is submitted twice?
For example: the word count example counts the occurrence of each word in each file in an input folder.
If I were to add a file to that folder, and re-run the word count mapreduce job, will the initial files be re-read, re-maped and re-reduced?
If so, is there a way to configure hadoop to process ONLY the new files and add it to a “summary” from previous mapreduce runs.
Any thought/help will be appreciated.
Hadoop will reprocess the entire data when run again. The output of the mappers and the temporary data is deleted when the job has been completed successfully.
Hadoop as-is doesn’t support such as scenario, but you could write a custom InputFormat which checks for the unprocessed or new files and a cutom OutputFormat which will add data to the summary from the previous run. Or else once the job has been run, the new files to be processed can be put in a different input folder and let the Job process only the files in the new folder.
Check this article in creating custom input/output formats.
I am not sure of the exact requirements but you can also consider frameworks which process streams of data like HStreaming, S4, Twitter Storm and others.