Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open file descriptors relative to each individual map and reduce operation? Specifically, what suboperations cause Hadoop to open a new file descriptor during job execution or spill to disk?
(This is deliberately ignoring use of MultipleOutputs, as it very clearly screws with the guarantees provided by the system.)
My rationale here is simple: I’d like to ensure each job I write for Hadoop guarantees a finite number of required file descriptors for each mapper or reducer. Hadoop cheerfully abstracts this away from the programmer, which would normally be A Good Thing, if not for the other shoe dropping during server management.
I’d originally asked this question on Server Fault from the cluster management side of things. Since I’m also responsible for programming, this question is equally pertinent here.
Here’s a post that offers some insight into the problem:
This implies that, for normal behavior, the number of mappers is exactly equivalent to the number of open file descriptors.
MultipleOutputsobviously skews this number by the number of mappers multiplied by the number of available partitions. Reducers then proceed as normal, generating one file (and thus, one file descriptor) per reduce operation.The problem then becomes: during a
spilloperation, most of these files are being held open by each mapper as output is cheerfully martialled by split. Hence the available file descriptors problem.Thus, the currently-assumed, maximum file descriptor limit should be:
And that, as we say, is that.