I’ve written a Hadoop Map Reduce job. When I run it locally, I notice that if I don’t specify any reduce tasks there are some temporary files written to the output directory. If I specify reducers no temporary files are written. Is this normal behavior? I would expect to see the temporary files written otherwise it would mean that the mapper is trying to do everything in memory and then transfer to the reducer in memory. This strikes me as implausible.
Any insights into how/when/where the mapper writes intermediate output to the file system would be appreciated.
Thanks
Map tasks write their output to the local disk, not to HDFS. Map output
is intermediate output: it’s processed by reduce tasks to produce the final output, and
once the job is complete the map output can be thrown away. So storing it in HDFS,
with replication, would be overkill.
But if we set number of reducers to 0 then map output is stored on HDFS as final output. There is no reduce phase so output of the mapper is the output of the whole job.
Additionally here is how to look into intermediate files even if reducer is specified.