In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem “stuck” at 33%– it’s waiting for mappers to finish.
Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.
Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.
Why is starting the reducers early a bad thing? Because they “hog up” reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can’t use them.
You can customize when the reducers startup by changing the default value of
mapred.reduce.slowstart.completed.mapsinmapred-site.xml. A value of1.00will wait for all the mappers to finish before starting the reducers. A value of0.0will start the reducers right away. A value of0.5will start the reducers when half of the mappers are complete. You can also changemapred.reduce.slowstart.completed.mapson a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called ismapreduce.job.reduce.slowstart.completedmaps(thanks user yegor256).Typically, I like to keep
mapred.reduce.slowstart.completed.mapsabove0.9if the system ever has multiple jobs running at once. This way the job doesn’t hog up reducers when they aren’t doing anything but copying data. If you only ever have one job running at a time, doing0.1would probably be appropriate.