Is there a way to set the replication factor for the output of a specific MapReduce job to be different than the rest of the cluster (say 1)? I’d like my main data set to be 3x replicas (as it is currently), but the output of some of my jobs move out of the cluster quickly and get tossed out eventually, so no replication is needed and I could use the space.
I could use setrep but I think I can only do that after the fact.
When you upload a file, you can override the DFS default replication factor by passing
This should work as well when passed when you invoke a job.