Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open

Question

0

Asked: May 18, 20262026-05-18T12:38:15+00:00 2026-05-18T12:38:15+00:00

Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open

0

Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open file descriptors relative to each individual map and reduce operation? Specifically, what suboperations cause Hadoop to open a new file descriptor during job execution or spill to disk?

(This is deliberately ignoring use of MultipleOutputs, as it very clearly screws with the guarantees provided by the system.)

My rationale here is simple: I’d like to ensure each job I write for Hadoop guarantees a finite number of required file descriptors for each mapper or reducer. Hadoop cheerfully abstracts this away from the programmer, which would normally be A Good Thing, if not for the other shoe dropping during server management.

I’d originally asked this question on Server Fault from the cluster management side of things. Since I’m also responsible for programming, this question is equally pertinent here.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T12:38:15+00:00

Here’s a post that offers some insight into the problem:

This happens because more small files are created when you use MultipleOutputs class.
Say you have 50 mappers then assuming that you don’t have skewed data, Test1 will always generate exactly 50 files but Test2 will generate somewhere between 50 to 1000 files (50Mappers x 20TotalPartitionsPossible) and this causes a performance hit in I/O. In my benchmark, 199 output files were generated for Test1 and 4569 output files were generated for Test2.

This implies that, for normal behavior, the number of mappers is exactly equivalent to the number of open file descriptors. MultipleOutputs obviously skews this number by the number of mappers multiplied by the number of available partitions. Reducers then proceed as normal, generating one file (and thus, one file descriptor) per reduce operation.

The problem then becomes: during a spill operation, most of these files are being held open by each mapper as output is cheerfully martialled by split. Hence the available file descriptors problem.

Thus, the currently-assumed, maximum file descriptor limit should be:

Map phase: number of mappers * total partitions possible

Reduce phase: number of reduce operations * total partitions possible

And that, as we say, is that.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply